Knowledge Discovery in Social Media and Scientific Digital Libraries

Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Knowledge Discovery in
Social Media and
Scientific Digital Libraries
Ansgar Scherp
Darmstadt, Feb 9, 2016
Thanks to: Chifumi Nishioka, Falk Böschen

KDD Social Media & Digital Libraries
How to deal with the vast amount of content related to
research and innovation?
“Ability to deal with digital information will be an
important cultural technique as reading and writing.”

KDD Social Media & Digital Libraries
• Examples of current research
1. Classifying tweets
2. Automated subject indexing
3. Extracting text from scholarly figures
• Today not in covered
–Schema-extraction from Linked Open Data
–Analysis of evolution of Linked Open Data

Classifying Tweets: Example
How far are there fundamental differences between
different approaches for tweet classification?
Author’s hashtag:
(here: none)
Human: #research
#talk #darmstadt
Machine: #talk
#socialmedia
(e.g., [Nishida et al. 12])(e.g., [Ren et al. 14]
[Yang et al. 14])
[NSD15] C. Nishioka, A. Scherp, and K. Dellschaft: Comparing Tweet Classifications by
Authors' Hashtags, Machine Learning, and Human Annotators, WI, Singapore, 2015.

Twitter Dataset: TREC Tweets2011
• Contains about 16 million tweets
• Randomly created
10 main topics with
two sub-topics
• Main topic: hashtag
occurs min. 200 times
main topic subtopics
1 #health #nutrition, #news
2 #apple #iphone, #mac
3 #photography #nature, #art
4 #green #solar, #eco
5 #celebrity #news, #gossip
6 #fashion #news, #shoes
7 #fitness #health, #exercise
8 #humor #quotes, #funny
9 #quote #love, #life
10 #travel #lp, #tips
• 5 classes per topic:
, ,
,
,
• Retrieved 3 tweets per class, i.e., 15 tweets per topic
• Task: classify tweets into groups

Method 1: Hashtag Classifier
• Assign classes to tweets by author’s hashtags
Class ‘#SpendingReview’
Class ‘#TurkeyDayTravel #travel’
Class ‘#TurkeyDayTravel’
Class ‘#travel’
• Multiple hashtags  consider as single class

Method 2: Machine Classifier
• Latent Dirichlet Allocation (LDA) to represent tweets
as probabilities over latent topics [Blei et al. 03]
• Construct of the model from TREC Tweets2011
– Train topic model over Tweets being aggregated by
their Twitter users [Hong et al. 10]
– Infer probability distribution over topics for each of
the 15 tweets
• Cluster tweets using k-means
– # of clusters optimized by Hartigan’s index
and Average Silhouette [Kaufman et al. 05]
– Using cosine similarity as a distance measure

Method 3: Human Classifier
• Online experiment: asked 163 human annotators
to manually classify the 15 tweets per each topic
main topic subtopics # annotators
1 #health #nutrition, #news 20
2 #apple #iphone, #mac 18
3 #photography #nature, #art 15
4 #green #solar, #eco 14
5 #celebrity #news, #gossip 15
6 #fashion #news, #shoes 15
7 #fitness #health, #exercise 18
8 #humor #quotes, #funny 15
9 #quote #love, #life 16
10 #travel #lp, #tips 17
∑ 163

Method 3: Human-classifier
• Annotators can create an arbitrary number of classes
and label them
• Have access to Tweet’s textual content as well as
screenshots of the links, but: hashtag ‘#’ removed
Class label

Degree of Classifier Agreement
• Methods 1-3 produce groups of Tweets
• Compare groups with Cohen’s kappa [Fu et al. 2012]
• Convert classifications into match tables
– Elements in same group: 1
– Otherwise: 0
• Example: tweets , , , ,
are classified by and in
and
• Compare match table using
• Example: and
Cohen’s

a b c d
b 1
c 0 0
d 0 0 1
e 0 0 0 0
a b c d
b 1
c 1 1
d 0 0 0
e 0 0 0 1
Classifier Classifier
=>

Agreements Between Classifiers
• Hashtag/Machine (HaM)
– Almost no agreement
– Except topic 3 “photography”:
11 of 15 tweets use the
hashtags also as a word in texts
• Hashtag/Human (HaHu)
– Slight agreements
• Machine/Human (MHu)
– Almost no agreement
– Except topic 10 “travel”:
agreement on the disagreement
at tweets having the hashtag “#tips”
ID HaM HaHu MHu
1 -0.05 0.12 0.00
2 0.02 0.05 0.05
3 0.24 0.06 0.11
4 0.01 0.11 0.00
5 0.00 0.07 -0.04
6 0.00 0.15 0.04
7 0.04 0.09 0.05
8 -0.04 0.17 0.03
9 -0.02 0.13 0.00
10 0.01 0.10 0.45
M 0.02 0.10 0.07
SD 0.08 0.10 0.12

Inter-Human-Annotator Agreement
• Fleiss’ kappa : measure agreement
among more than two raters
• Consistently observe larger agreements
among human classifiers than for
HaHu and MHu
• Difference is significant (with )
1 0.17
2 0.10
3 0.13
4 0.16
5 0.53
6 0.20
7 0.14
8 0.31
9 0.33
10 0.38
M 0.25
SD 0.14
Researchers should use ground truth made by human
annotators rather than hashtags for tweet classification

Automatic Subject Indexing
[GNS15] G. Große-Bölting, C. Nishioka, A. Scherp: A Comparison of Different
Strategies for Automated Semantic Document Annotation. K-CAP 2015
STW (Standard
Thesaurus Wirtschaft)
Cancer (18899-3)
Research (10436-6)
USA (17829-1)
…
Nomination for Best Paper Award at K-CAP 2015
Award „Prof. Dr. Werner Petersen-Preis der Technik 2015”
Published as
Linked Open Data!

Automated Subject Indexing
• Scientific search engine GERHARD (‘97-‘99)
• Ontology with ~10,000 classes in three languages

Experiment Framework
Each strategy is a composition of methods from 1. + 2. + 3.
1. Concept Extraction
detect concepts (candidate annotations) from each document
2. Concept Activation
compute a score for each concept of a document
3. Annotation Selection
select annotations from concepts for each document
4. Evaluation
measure performance of strategies with ground truth

Configurations
Entity Tri-gram LDARAKE
Statistical
Methods
(2 methods)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation

Configurations: Entity-based
24 strategies
Statistical
Methods
(2 methods)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation
… using a domain-specific taxonomy like STW

Concept Activation Methods
• Concept frequency:
• CF-IDF as extension of popular TF-IDF model
replacing terms with concepts [Goossen et al. 11]
– IDF lowers weight for concepts appearing in many
documents
• Do actually not “activate” anything …

Hierarchy-based Methods
• Reveal concepts that are
not explicitly mentioned
by using a hierarchical
knowledge base (KB)
• KBs are of high quality and freely available !
Social
Recommendation
Social
Tagging
Web Searching Web Mining
Site
Wrapping
Web Log
Analysis
World Wide Web
• Base Activation with set of child concepts of
concept and decay parameter
∈
• Example with :
, ,

Hierarchy-based Methods
• One-hop activation
– Developed with domain experts at ZBW
: set of concepts detected in a document
– Maximum activation distance: one hop
,
, ∙ ,
∈
if | ∩ | 2
, otherwise
Works very well … why?

Graph-based Methods
• Represent concepts as
co-occurrence graph
Tax
Bank
Interest Rate
Financial Crisis
Central Bank
• HITS for link analysis of web sites [Kleinberg 99]
with
∈
∈
• Degree as number of edges linked with a concept
[Zouaq et al. 12]:
– Example:

15 strategies
Statistical
Method
(2 methods)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation
Configurations: n-grams

3 strategies
Statistical
Method
(Frequency)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation
Configurations: RAKE
[Rose et al. 10]

Statistical
Methods
(Frequency)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation
Configuration: LDA
43 strategies in total*
[Blei et al. 03]

Datasets: 3 Scientific Domains
Economics Politics Computer
Source ZBW FIV SemEval 2010
# documents 62,924 28,324 244
# annotations 5.26 (± 1.84) 12 (± 4.02) 5.05 (± 2.41)
Knowledge
base
STW European
Thesaurus
ACM CCS
# enities 6,335 7,912 2,299
# labels 11,679 8,421 9,086
• Computer science dataset: SemEval 2010 [Kim et al. 10]
• Pre‐processing of author keywords needed [Wang et al. 14]
• Total of ~100,000 scientific documents: largest so far !

Concept
Extraction
Annotation
Selection
Concept
Activation
Best Performing Configurations
Best strategy: Entity × HITS × kNN
: (economy), (politics), (computer)
Graph-based
Methods
(3 methods)
kNN
(1 method)
Statistical
Methods
(2 methods)
Hierarchy-based
Methods
(3 methods)
Top-k
(2 methods)
Close ones: OneHop
as well as any other
graph-based method

Number of Users
15
10
20‐
Total Users
Win (16) Lin (3)
Preferred Operatinq System
N 20
[‐1 Macintosh
Linux
Mac(1)
Windows
5
Textextraction from Scolarly Figures
Binarization
Clustering
Extraction
OCR
Text
[BS15] F. Böschen, A. Scherp: Multi-oriented Text Extraction from Information
Graphics. DocEng 2015: 35-38
Fully-automated TX pipeline
No assumptions, no training
Novel combination of DM & CV

Challenges for Research
• Different font sizes
• … font colors
• … background colors
• … emphases
• Different angles
• Overlapping elements

121 Scolarly Figures in Economics
(from ZBW Open Access Corpus)
Current results: improvement of
text recognition to BL: up to 30%

Evaluation Setup
item 1 Item 1
{e, i, m, t, 1}
{em, it, te}
{ite, tem}
{e, m, t, I, 1}
{em, te, It}
{tem, Ite}
Unigrams
Bigrams
Trigrams
• How to match output (left) with gold standard (right)?

Limits of Current Evaluation
• Baseline #1: OCR engine Tesseract (Google)
with layout analysis
• 1 pass per figure
• Baseline #2: OCR engine Tesseract (Google)
with layout analysis
• Multiple, angle-rotated passes
+ + + +
Comparison with related work: very difficult!

Evaluation: Orientation Distributions
Note: horizontal equals to ±15° (Tesseract’s rotation tolerances)

Mockup: Use of TX in ZBW’s EconBiz

Summary: KDD in Social Media & DL
How to deal with the vast amount of content related to
research and innovation?
• H2020 INSO-4 project, duration: 04/2016-03/2019
• Platform with data mining and visualization tools for
enabling information professionals to deal with large
corpora of scientific content, data, social media
New

Knowledge Discovery in Social Media and Scientific Digital Libraries

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (7)

Andere mochten auch

Andere mochten auch (11)

Ähnlich wie Knowledge Discovery in Social Media and Scientific Digital Libraries

Ähnlich wie Knowledge Discovery in Social Media and Scientific Digital Libraries (20)

Mehr von Ansgar Scherp

Mehr von Ansgar Scherp (16)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Knowledge Discovery in Social Media and Scientific Digital Libraries