The talk presents selected results of our research in the area of text and data mining in social media and scientific literature. (1) First, we consider the area of classifying microblogging postings like tweets on Twitter. Typically, the classification results are evaluated against a gold standard, which is either the hashtags of the tweets’ authors or manual annotations. We claim that there are fundamental differences between these two kinds of gold standard classifications and conducted an experiment with 163 participants to manually classify tweets from ten topics. Our results show that the human annotators are more likely to classify tweets like other human annotators than like the tweets’ authors (i. e., the hashtags). This may influence the evaluation of classification methods like LDA and we argue that researchers should reflect the kind of gold standard used when interpreting their results. (2) Second, we present a framework for semantic document annotation that aims to compare different existing as well as new annotation strategies. For entity detection, we compare semantic taxonomies, trigrams, RAKE, and LDA. For concept activation, we cover a set of statistical, hierarchy-based, and graph-based methods. The strategies are evaluated over 100,000 manually labeled scientific documents from economics, politics, and computer science. (3) Finally, we present a processing pipeline for extracting text of varying size, rotation, color, and emphases from scholarly figures. The pipeline does not need training nor does it make any assumptions about the characteristics of the scholarly figures. We conducted a preliminary evaluation with 121 figures from a broad range of illustration types.
URL: https://www.ukp.tu-darmstadt.de/ukp-home/news-singleview/artikel/guest-speaker-ansgar-scherp/
DevEX - reference for building teams, processes, and platforms
Knowledge Discovery in Social Media and Scientific Digital Libraries
1. Slide 1Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Knowledge Discovery in
Social Media and
Scientific Digital Libraries
Ansgar Scherp
Darmstadt, Feb 9, 2016
Thanks to: Chifumi Nishioka, Falk Böschen
2. Slide 2Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
KDD Social Media & Digital Libraries
How to deal with the vast amount of content related to
research and innovation?
“Ability to deal with digital information will be an
important cultural technique as reading and writing.”
3. Slide 3Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
KDD Social Media & Digital Libraries
• Examples of current research
1. Classifying tweets
2. Automated subject indexing
3. Extracting text from scholarly figures
• Today not in covered
–Schema-extraction from Linked Open Data
–Analysis of evolution of Linked Open Data
4. Slide 4Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Classifying Tweets: Example
How far are there fundamental differences between
different approaches for tweet classification?
Author’s hashtag:
(here: none)
Human: #research
#talk #darmstadt
Machine: #talk
#socialmedia
(e.g., [Nishida et al. 12])(e.g., [Ren et al. 14]
[Yang et al. 14])
[NSD15] C. Nishioka, A. Scherp, and K. Dellschaft: Comparing Tweet Classifications by
Authors' Hashtags, Machine Learning, and Human Annotators, WI, Singapore, 2015.
5. Slide 5Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Twitter Dataset: TREC Tweets2011
• Contains about 16 million tweets
• Randomly created
10 main topics with
two sub-topics
• Main topic: hashtag
occurs min. 200 times
main topic subtopics
1 #health #nutrition, #news
2 #apple #iphone, #mac
3 #photography #nature, #art
4 #green #solar, #eco
5 #celebrity #news, #gossip
6 #fashion #news, #shoes
7 #fitness #health, #exercise
8 #humor #quotes, #funny
9 #quote #love, #life
10 #travel #lp, #tips
• 5 classes per topic:
, ,
,
,
• Retrieved 3 tweets per class, i.e., 15 tweets per topic
• Task: classify tweets into groups
6. Slide 6Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Method 1: Hashtag Classifier
• Assign classes to tweets by author’s hashtags
Class ‘#SpendingReview’
Class ‘#TurkeyDayTravel #travel’
Class ‘#TurkeyDayTravel’
Class ‘#travel’
• Multiple hashtags consider as single class
7. Slide 7Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Method 2: Machine Classifier
• Latent Dirichlet Allocation (LDA) to represent tweets
as probabilities over latent topics [Blei et al. 03]
• Construct of the model from TREC Tweets2011
– Train topic model over Tweets being aggregated by
their Twitter users [Hong et al. 10]
– Infer probability distribution over topics for each of
the 15 tweets
• Cluster tweets using k-means
– # of clusters optimized by Hartigan’s index
and Average Silhouette [Kaufman et al. 05]
– Using cosine similarity as a distance measure
9. Slide 9Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Method 3: Human-classifier
• Annotators can create an arbitrary number of classes
and label them
• Have access to Tweet’s textual content as well as
screenshots of the links, but: hashtag ‘#’ removed
Class label
10. Slide 10Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Degree of Classifier Agreement
• Methods 1-3 produce groups of Tweets
• Compare groups with Cohen’s kappa [Fu et al. 2012]
• Convert classifications into match tables
– Elements in same group: 1
– Otherwise: 0
• Example: tweets , , , ,
are classified by and in
and
• Compare match table using
• Example: and
Cohen’s
a b c d
b 1
c 0 0
d 0 0 1
e 0 0 0 0
a b c d
b 1
c 1 1
d 0 0 0
e 0 0 0 1
Classifier Classifier
=>
11. Slide 11Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Agreements Between Classifiers
• Hashtag/Machine (HaM)
– Almost no agreement
– Except topic 3 “photography”:
11 of 15 tweets use the
hashtags also as a word in texts
• Hashtag/Human (HaHu)
– Slight agreements
• Machine/Human (MHu)
– Almost no agreement
– Except topic 10 “travel”:
agreement on the disagreement
at tweets having the hashtag “#tips”
ID HaM HaHu MHu
1 -0.05 0.12 0.00
2 0.02 0.05 0.05
3 0.24 0.06 0.11
4 0.01 0.11 0.00
5 0.00 0.07 -0.04
6 0.00 0.15 0.04
7 0.04 0.09 0.05
8 -0.04 0.17 0.03
9 -0.02 0.13 0.00
10 0.01 0.10 0.45
M 0.02 0.10 0.07
SD 0.08 0.10 0.12
12. Slide 12Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Inter-Human-Annotator Agreement
• Fleiss’ kappa : measure agreement
among more than two raters
• Consistently observe larger agreements
among human classifiers than for
HaHu and MHu
• Difference is significant (with )
1 0.17
2 0.10
3 0.13
4 0.16
5 0.53
6 0.20
7 0.14
8 0.31
9 0.33
10 0.38
M 0.25
SD 0.14
Researchers should use ground truth made by human
annotators rather than hashtags for tweet classification
13. Slide 13Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Automatic Subject Indexing
[GNS15] G. Große-Bölting, C. Nishioka, A. Scherp: A Comparison of Different
Strategies for Automated Semantic Document Annotation. K-CAP 2015
STW (Standard
Thesaurus Wirtschaft)
Cancer (18899-3)
Research (10436-6)
USA (17829-1)
…
Nomination for Best Paper Award at K-CAP 2015
Award „Prof. Dr. Werner Petersen-Preis der Technik 2015”
Published as
Linked Open Data!
14. Slide 14Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Automated Subject Indexing
• Scientific search engine GERHARD (‘97-‘99)
• Ontology with ~10,000 classes in three languages
15. Slide 15Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Experiment Framework
Each strategy is a composition of methods from 1. + 2. + 3.
1. Concept Extraction
detect concepts (candidate annotations) from each document
2. Concept Activation
compute a score for each concept of a document
3. Annotation Selection
select annotations from concepts for each document
4. Evaluation
measure performance of strategies with ground truth
18. Slide 18Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Concept Activation Methods
• Concept frequency:
• CF-IDF as extension of popular TF-IDF model
replacing terms with concepts [Goossen et al. 11]
– IDF lowers weight for concepts appearing in many
documents
• Do actually not “activate” anything …
19. Slide 19Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Hierarchy-based Methods
• Reveal concepts that are
not explicitly mentioned
by using a hierarchical
knowledge base (KB)
• KBs are of high quality and freely available !
Social
Recommendation
Social
Tagging
Web Searching Web Mining
Site
Wrapping
Web Log
Analysis
World Wide Web
• Base Activation with set of child concepts of
concept and decay parameter
∈
• Example with :
, ,
20. Slide 20Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Hierarchy-based Methods
• One-hop activation
– Developed with domain experts at ZBW
: set of concepts detected in a document
– Maximum activation distance: one hop
,
, ∙ ,
∈
if | ∩ | 2
, otherwise
Works very well … why?
21. Slide 21Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Graph-based Methods
• Represent concepts as
co-occurrence graph
Tax
Bank
Interest Rate
Financial Crisis
Central Bank
• HITS for link analysis of web sites [Kleinberg 99]
with
∈
∈
• Degree as number of edges linked with a concept
[Zouaq et al. 12]:
– Example:
25. Slide 25Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Datasets: 3 Scientific Domains
Economics Politics Computer
Source ZBW FIV SemEval 2010
# documents 62,924 28,324 244
# annotations 5.26 (± 1.84) 12 (± 4.02) 5.05 (± 2.41)
Knowledge
base
STW European
Thesaurus
ACM CCS
# enities 6,335 7,912 2,299
# labels 11,679 8,421 9,086
• Computer science dataset: SemEval 2010 [Kim et al. 10]
• Pre‐processing of author keywords needed [Wang et al. 14]
• Total of ~100,000 scientific documents: largest so far !
26. Slide 26Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Concept
Extraction
Annotation
Selection
Concept
Activation
Best Performing Configurations
Best strategy: Entity × HITS × kNN
: (economy), (politics), (computer)
Entity Tri-gram LDARAKE
Graph-based
Methods
(3 methods)
kNN
(1 method)
Statistical
Methods
(2 methods)
Hierarchy-based
Methods
(3 methods)
Top-k
(2 methods)
Close ones: OneHop
as well as any other
graph-based method
27. Slide 27Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Number of Users
15
10
20‐
Total Users
Win (16) Lin (3)
Preferred Operatinq System
N 20
[‐1 Macintosh
Linux
Mac(1)
Windows
5
Textextraction from Scolarly Figures
Binarization
Clustering
Extraction
OCR
Text
[BS15] F. Böschen, A. Scherp: Multi-oriented Text Extraction from Information
Graphics. DocEng 2015: 35-38
Fully-automated TX pipeline
No assumptions, no training
Novel combination of DM & CV
28. Slide 28Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Challenges for Research
• Different font sizes
• … font colors
• … background colors
• … emphases
• Different angles
• Overlapping elements
29. Slide 29Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
121 Scolarly Figures in Economics
(from ZBW Open Access Corpus)
Current results: improvement of
text recognition to BL: up to 30%
30. Slide 30Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Evaluation Setup
item 1 Item 1
{e, i, m, t, 1}
{em, it, te}
{ite, tem}
{e, m, t, I, 1}
{em, te, It}
{tem, Ite}
Unigrams
Bigrams
Trigrams
• How to match output (left) with gold standard (right)?
31. Slide 31Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Limits of Current Evaluation
• Baseline #1: OCR engine Tesseract (Google)
with layout analysis
• 1 pass per figure
• Baseline #2: OCR engine Tesseract (Google)
with layout analysis
• Multiple, angle-rotated passes
+ + + +
Comparison with related work: very difficult!
33. Slide 33Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Mockup: Use of TX in ZBW’s EconBiz
34. Slide 34Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Summary: KDD in Social Media & DL
How to deal with the vast amount of content related to
research and innovation?
• H2020 INSO-4 project, duration: 04/2016-03/2019
• Platform with data mining and visualization tools for
enabling information professionals to deal with large
corpora of scientific content, data, social media
New