Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web
1. INSIGHT Centre for Data Analytics
www.insight-centre.org
Characterising concepts of interest
leveraging Linked Data
and the Social Web
Fabrizio Orlandi, Pavan Kapanipathi,
Amit Sheth, Alexandre Passant
IEEE/WIC/ACM Web Intelligence
Atlanta, GA, USA
20th November 2013
Copyright 2013 INSIGHT Centre for Data Analytics. All rights reserved.
Semantic Web & Linked Data
Research Programme
2. Scenario:
Personalisation and User Profiling on the Social Web
INSIGHT Centre for Data Analytics
www.insight-centre.org
Semantic Web & Linked Data
Research Programme
http://www.flickr.com/photos/giladlotan/
3. INSIGHT Centre for Data Analytics
www.insight-centre.org
Semantic Web & Linked Data
Research Programme
4. INSIGHT Centre for Data Analytics
www.insight-centre.org
Semantic Web & Linked Data
Research Programme
5. Solution
INSIGHT Centre for Data Analytics
www.insight-centre.org
Interlink social websites
Integration
&
User Modelling
Merge and model user data
Personalise users’ experience
using their profile
User Profile
Recommendations
Adaptive Systems
Search Personalisation
[Orlandi et al., I-Semantics 2012]
Semantic Web & Linked Data
Research Programme
6. Problem
INSIGHT Centre for Data Analytics
www.insight-centre.org
Entity-based user profiles of interests:
Sport
CEV Volleyball Cup
Music
Heavy Metal
Mastodon
Atlanta
…
6
Semantic Web & Linked Data
Research Programme
7. Problem
INSIGHT Centre for Data Analytics
www.insight-centre.org
Entity-based user profiles of interests:
Semantics?
Pragmatics?
Sport
CEV Volleyball Cup
Music
Heavy Metal
Mastodon
Relevance?
Atlanta
…
7
Semantic Web & Linked Data
Research Programme
8. Linking Open Data
INSIGHT Centre for Data Analytics
8
www.insight-centre.org
The Semantics of the Web of Data
LOD Cloud by R. Cyganiak
and A. Jentzsch
Semantic Web & Linked Data
Research Programme
9. Example
INSIGHT Centre for Data Analytics
www.insight-centre.org
“Mastodon is the best heavy metal band from Atlanta…
Can’t wait to see them live again!”
“Trentino vs Lugano about to start - Diatec youngster to
impress again in CEV Champions League #volleyball”
“W3C Invites Implementations of five Candidate
Recommendations for RDF 1.1 #SemanticWeb”
Music
Heavy Metal
Mastodon
• Named entity recognition
and disambiguation
• Frequency + time-decay
weighting scheme
Atlanta
CEV Champions League
Volleyball
Semantic Web
RDF
9
Semantic Web & Linked Data
Research Programme
10. Example
INSIGHT Centre for Data Analytics
www.insight-centre.org
Are all the extracted entities useful for personalisation?
How are concepts/entities being used on the Social Web? (Pragmatics)
Music
Heavy Metal
Mastodon (band)
Atlanta (GA.)
CEV Champions League
Volleyball
Very abstract, very popular
Very popular
Specific and time-dependent on events, etc.
Specific, very popular and time-dependent
Specific and time-dependent on events, etc.
Abstract and popular
Semantic Web
RDF
10
Abstract and not popular
Specific and not popular
Semantic Web & Linked Data
Research Programme
11. The Dimensions of our
Characterisation
INSIGHT Centre for Data Analytics
Specificity
www.insight-centre.org
The level of abstraction that an entity has in a common
conceptual schema shared by humans
Popularity
How popular an entity is on the Social Web
– How frequently is it mentioned/used at that point of time?
Temporal Dynamics
The trend and evolution of the frequency of mentions of an
entity on the Social Web
– i.e. popularity over time
11
Semantic Web & Linked Data
Research Programme
12. Requirements
INSIGHT Centre for Data Analytics
www.insight-centre.org
Our use case: real-time personalisation of Social
Web streams
1.
(quasi-) Real-time computation of the dimensions
2.
Results constantly up to date with the real world
3.
Knowledge base and domain independent approach
12
Semantic Web & Linked Data
Research Programme
13. Popularity
INSIGHT Centre for Data Analytics
www.insight-centre.org
We chose the Twitter Search API
We search for an entity on the Twitter stream in a short recent time
frame.
Run entity disambiguation on the resulting tweets to filter out noisy
tweets.
Count the remaining tweets in a given timeframe.
The Popularity measure is the resulting value in tweets/second.
This is fast, simple, up-to-date, only for short recent timeframe.
e.g. “Music”~ 16.6 tw/s
“Heavy Metal”~ 0.09 tw/s
“Semantic Web”~ 0.0008 tw/s
13
Semantic Web & Linked Data
Research Programme
14. Temporal Dynamics
INSIGHT Centre for Data Analytics
www.insight-centre.org
We use Wikipedia page views
Entities are already mapped to DBpedia
MediaWiki API provides a long history of daily page views of
Wikipedia articles
We use Mean and Standard Deviation for the last 30 days of page
views to identify if the popularity of an entity is:
– Stable/Unstable
– Trendy/Non-Trendy
CEV_Champions_League
Typhoon_Haiyan (2013)
(Diagrams from: stats.grok.se)
Semantic Web & Linked Data
Research Programme
15. Specificity
INSIGHT Centre for Data Analytics
www.insight-centre.org
We use the Linking Open Data (LOD) cloud
Most of the available knowledge bases (e.g. DMOZ, Wordnet,
OpenCyc) are not up-to-date.
Wikipedia would be large, domain-independent, continuously
updated, but:
– entities are not organised hierarchically in a taxonomy
– We cannot use taxonomy-based methods (i.e. super/sub -type rel.)
– PLUS: expensive algorithms would not be good for real-time computation
LOD Links Structure!
15
Semantic Web & Linked Data
Research Programme
16. Graph based measures
INSIGHT Centre for Data Analytics
www.insight-centre.org
SOA graph based method:
indegree and outdegree
(here called Incoming/Outgoing Predicates – IP and OP)
We can use these methods with RDF triples
We introduce “distinct in/out-degree” (IDP and ODP )
s1
p1
p1
s2
p2
p3
m
o1
p4
o2
Values for “m”:
IP (indegree) = 3
OP (outdegree) = 2
IDP (distinct indegree) = 2
ODP (distinct outdegree) = 2
s3
16
Semantic Web & Linked Data
Research Programme
17. Our Specificity Measure
INSIGHT Centre for Data Analytics
www.insight-centre.org
DRR (Distinct Relations Ratio):
Incoming Distinct Predicates (IDP)
DRR =
Outgoing Distinct Predicates (ODP)
Compared with:
IP/OP, IP+OP, IP, IDP
Computed on Sindice SPARQL
endpoint in less than 1sec.
17
Semantic Web & Linked Data
Research Programme
18. Alternative SOA Method
INSIGHT Centre for Data Analytics
www.insight-centre.org
DMOZ (Open Directory Project) taxonomy
18
We use the hierarchical structure of DMOZ as an alternative method to
measure specificity.
We manually map entities to the DMOZ entities and compute the
distance from the root of the DMOZ tree.
Semantic Web & Linked Data
Research Programme
19. Generation of a Gold Standard
INSIGHT Centre for Data Analytics
www.insight-centre.org
Binary classification of entities
5 humans classified 160 entities in:
– Generic (38%)
– Specific (62%)
Substantial agreement (k=0.61)
Ranking of entities
5 humans rated the specificity of 160 entities in:
– 1 to 10 scale (1=very generic, 10=very specific)
Average Rate
7.03
Average Std. Dev.
1.45
AVG Top 30 High Std. Dev.
5.66
AVG Top 30 Low Std. Dev.
7.51
Abstract entities are harder
for humans to rate
19
Semantic Web & Linked Data
Research Programme
20. Evaluation: Classification
INSIGHT Centre for Data Analytics
www.insight-centre.org
We compared the different methods against the gold standard
created manually by the users
Agreement with gold std. in the binary classification task:
DMOZ
IP/OP
IP+OP
IP
random
83.9%
DRR
84.1%
70.0%
70.0%
72.5%
61.9%
The performance of the DRR measure for this classification task
is comparable to a manual classification done using the DMOZ
taxonomy and to human judgement.
20
Semantic Web & Linked Data
Research Programme
21. Evaluation: Ranking
INSIGHT Centre for Data Analytics
www.insight-centre.org
We rank the specificity of 50 randomly chosen entities using:
Gold standard (average of the 5 users’ rates for each entity)
DMOZ levels (integers, 0 to 9)
– We compute “DMOZ-” and “DMOZ+” as the worst and best possible rankings
compared to the gold standard ranking.
DRR, IP/OP, IP+OP, random, values (real numbers)
We compute NDCG (Normalized Discounted Cumulative Gain) at
different ranking positions “p”.
(DCGideal is the ranking of the gold std.)
Semantic Web & Linked Data
Research Programme
22. Evaluation: Ranking
INSIGHT Centre for Data Analytics
www.insight-centre.org
DRR: +5% for NDCG at 10 and 20
Semantic Web & Linked Data
Research Programme
23. Evaluation on User Profiles
INSIGHT Centre for Data Analytics
www.insight-centre.org
We evaluate the impact of the proposed measures on user
profiles of interests, a real use case
Interests extracted from users’ posts on Facebook and Twitter
with NLP tools (as described in our previous work [1])
Frequency-based + time decay weighting strategy
Each user rated his/her Top 30 list of interests generated (total
of 794 user ratings)
23
27 volunteers
Ratings on a “1 to 5” scale according to how relevant/interesting
is each entity of interest to the user (5 is highly relevant)
[1] Orlandi et al., I-Semantics 2012
Semantic Web & Linked Data
Research Programme
24. Evaluation on User Profiles
INSIGHT Centre for Data Analytics
www.insight-centre.org
Average score (1 to 5 scale) is computed according to groups of types of
entities
(+8%)
(17%)
(+12%)
24
Not-popular and generic entities better represent users’ perception of
their interests (but we have only 17% of them)
This behaviour might be different in other applications and use cases!
(e.g. news recommendations, etc.)
Semantic Web & Linked Data
Research Programme
25. Conclusions
INSIGHT Centre for Data Analytics
www.insight-centre.org
Introduced dimensions for characterisation of concepts of interest:
specificity, popularity and temporal dynamics.
Proposed methods for their computation satisfying requirements for
real-time personalisation of Social Web streams:
Introduced a novel measure (DRR) for specificity of concepts based
on the LOD cloud
Evaluated for two different tasks (classification and ranking) against SOA
methods (humans, DMOZ, graph measures)
Evaluated the impact of the measures on user profiles of interests
(27 users and ~800 ratings)
25
Real-time, domain independent, up to date.
Abstract and non-popular interests are preferred by users
Semantic Web & Linked Data
Research Programme
26. Future work
INSIGHT Centre for Data Analytics
www.insight-centre.org
Experiment the measures on user profiles used for different
personalisation tasks.
E.g. a tweets recommender system should give priority to trendy,
popular and specific entities instead.
Improve the simple popularity and trend detection methods.
Improve the DRR measure adding more “semantics”, i.e. considering
the different types of edges.
26
Semantic Web & Linked Data
Research Programme
27. Thanks!
INSIGHT Centre for Data Analytics
www.insight-centre.org
@badmotorf
fabrizio.orlandi@deri.org
@pavankaps
pavan@knoesis.org
@amit_p
amit@knoesis.org
@terraces
alex@seevl.net
Semantic Web & Linked Data
Research Programme