6. Semantic Distance over
Linked Data
• Relying only on links
• Relying only on instance data
• Using dereferencable URIs
• And using resources following the LD
principles
9. G = (R, L, I)
e:l1
e:r1 e:l1 e:r2
• R = {r , r , ..., r }
1 2 n
e:l2
• L = {l , l , ..., l }
1 2 n e:l2 e:l3 e:l3
• I = {i , i , ..., i }
1 2 n
e:r3 e:r4
18. At a glance
• A system providing recommendations for all
DBpedia bands and artists (±40K) using LDSD
• And explaining its recommendations
• Both using Linked Data and Semantic
Web standards (RDF, SPARQL)
• Integrating related Web data for an improved
user-experience
19. Architecture
(2) Dataset reducing
(1) Dataset (3) LDSD (4) User
identification computation interface
RDF Data RDF Data
20. Dataset
• Retrieving all artists and bands in DBpedia (±40K)
• Including incoming / outcoming links
• Approximately 3M triples
• Removing datatype properties
• 2.2M (75%)
• Merging /ontology and /property
• 1.7M (55%)
21. Distribution
20K+ artists (50%) are
not linked to any other
artist
22. Curation
• 118 properties linking artists together
• 18 mis-used, 35 wrongly defined (e.g.
dbprop:klfsgProperty)
• 578 properties linking artist to resources
• 183 used only once, 36 wrongly defined
• 767 properties linking resources to artists
• 336 used only once, 115 wrongly defined
• Dataset reduced to 1M triples
23. Computing distance
• 9,797 minutes
Done for all artists in
DBpedia
Artist Time (sec.)
Ramones 25.20
• 2 x AMD Opteron 250 Johnny Cash 61.16
4GB Ubuntu 8.10 U2 50.06
• 50M triples The Clash
Bad Religion
43.34
34.98
• Modelled using the The Aggrolites 7.35
LDSD ontology Janis Joplin 23.12
24. Artist Distance
Elvis Presley 0.0978
June Carter Cash 0.1056
Willie Nelson 0.1322
Kris Kristofferson 0.1407
Bob Dylan 0.1466
Marty Robbins 0.1673
Rosanne Cash 0.1782
Charlie McCoy 0.1836
Gene Autry 0.1910
Carl Smith 0.1980
28. Evaluation settings
• Off-line and on-line user evaluation
• Using common RecSys metrics
• 10 subjects
• 2 women, 8 men
• 24 to 34 years old
• 35 to 55 minutes per interview, F2F
29. Metrics
• Off-line evaluation - comparison with last.fm
• 5 artists / bands
• 2 blind list, 10 ranked recommendations per list
• Marks from 1 to 5
• On-line recommendation - dbrec only
• 5 artists / bands
• Browsing recommendations using dbrec
• Marks from 1 to 5, plus observations and interviews
30. dbrec vs last.fm
• Average mark of recommendations
• 3.37(±1.19)
• 3.44(±1.25) w/ on-line
• 3.69(±1.01) for last.fm
31. Results for the precision
(t=X means items are
Precision
relevant if ranked X or
more)
Cannot compute recall
dbrec dbrec
(implies users know all
bands in the system)
last.fm
(off-line) (off+on-line)
t=2 92.05 90.59 98.32
t=3 76.63 77.72 87.91
t=4 49.06 51.23 58.05
t=5 20.09 25 25.165
32. Novel recommendations
• Lots of unknown recommendations
• 62% for dbrec (59.6% w/ on-line)
• 40.4% for last.fm
• But that’s a good news !
• Evaluated 274 of them on dbrec
• 3.05(± 1.09)
33. Observations
• Explanations for unknown bands
• Checked for 198 / 310
• But also for known ones
• 24 / 190
• Helped to understand the recommendation
• Even if they already knew the band
34. Interviews
User-interface Explanations
Enjoyable 9 7
Useful 9 9
Enriching 8 10
Easy to use 10 9
Confusing 0 2
Complicated 0 2
Too geeky 1 6
36. Data quality
• Issues with DBpedia properties
• Misused : dbprop:notableInstruments
• Wrongly defined : dbprop:klfsgProperty
• Duplicates : /ontology versus /property
• Requires data curation !
• Automated and manual
37. Use, but replicate
• More and more public SPARQL endpoints
• Often limited to X max results
• 5,000 on DBpedia But, that’s fair enough.
Hosting a SPARQL
endpoint is costly and
• Difficult to use in production
opening-it up fully to
anyone would require lots
of maintenance, etc.
• Requires local replica
• But implies synchronisation !
38. Use, but replicate
SELECT ?label
WHERE {
?x rdfs:label ?label .
{ ?x a dbpedia:MusicArtist }
UNION
{ ?x a dbpedia:Band }
}
39. Use, but replicate
• Names of all DBpedia artists
• Get number of results w/ COUNT
• Run n/5000 queries (LIMIT + OFFSET)
• Recompose results The query had more than
40K results, since most
artists got their names
• Network errors, etc.
using different
languages.
So much more than 8
queries
40. SPARQL, Be quick or be neat
• “List all artists / bands sharing common
property-values with the current one”
• Fits in a single SPARQL query
• But does not scale
• “Optimisation” has to be done manually by
splitting the query and recomposing results
using an external script
41. SPARQL, Be quick or be neat
Tests done in the local
RDF store
1: full-query
2: split by property
3: split by property-
object
Up to 75% faster
Direct SPARQL Property-slicing Complete-slicing
Queries Time Queries Time Queries Time
Ramones 1 139.97 20 109.51 66 37.84
Johnny Cash 1 257.81 30 152.60 135 75.35
U2 1 155.53 22 122.91 70 44.03
The Clash 1 146.43 20 110.84 79 42.61
Bad Religion 1 104.08 23 86.49 97 47.35
The Aggrolites 1 145.92 13 114.52 28 28.33
Janis Joplin 1 230.88 27 151.00 98 62.81
43. Next steps
• Other data sources
• FreeBase, MusicBrainz, etc.
• Distance improvement
• Propagation, feature selection, etc.
• User Interface
• User-friendly explanations
• LOD-compliance
• Mapping with other ontologies, SPARQL endpoint
44. Conclusion
• Defined and applied a Semantic Distance
measure to Linked Data
• Used it to build a end-user music
recommender system, with ±40K artists
• Evaluated it using RecSys metrics
• Learnt several domain-independent lessons
regarding LOD consumption
47. Pictures credits
• http://flickr.com/photos/yumlog2/20896759/ by yuki*
• http://richard.cyganiak.de/2007/10/lod/ by Richard Cyganiak and Anja Jentzsch
• http://flickr.com/photos/loungerie/2196866243/ by loungerie
• http://flickr.com/photos/iskanderstruck/248786430/ by iskanderbenamor
• http://flickr.com/photos/homer4k/461407380/ by homer4k
• http://flickr.com/photos/jpellgen/2390204986/ by jpellgen
• http://flickr.com/photos/onegoodbumblebee/839927986/ by One Good Bumblebee
• http://flickr.com/photos/28509009@N03/2668650475/ by marcreis
• http://flickr.com/photos/8049973@N03/2656140464/ by wolf.tone