1. SchemEX â Building an Index
for Linked Open Data
Ansgar Scherp, Thomas Gottron, Mathias Konrath
University of Koblenz-Landau, Germany
Oslo, Norway
August 2012
SchemEX â Building an Index for LOD Slide 1 of 44
2. Learning Goals
⢠Understand the motivation and
fundamentals of Linked Open Data (LOD).
⢠Qualify in why an index for LOD is needed
and how to efficiently create such an index.
SchemEX â Building an Index for LOD Slide 2 of 44
3. Scenario
⢠Tim plans to travel
â from London
â to a customer in Cologne
SchemEX â Building an Index for LOD Slide 3 of 44
4. Website of the German Railway
It works, why bother�
SchemEX â Building an Index for LOD Slide 4 of 44
5. Letâs Try Different Queries
ď§ Bottlenecks in public transportation?
ď§ Compare the connections with flights?
ď§ Visualize on a map?
ď§âŚ
ď§ All these queries cannot be answered,
because the data âŚ
SchemEX â Building an Index for LOD Slide 5 of 44
6. ⌠locked in Silos!
â High Integration Effort
â Lack in Reuse of Data
SchemEX â Building an Index for LOD Slide 6Jagendorf, http://www.flickr.com/photos/bobjagendorf/, CC-BY
B. of 44
7. Linked Data
⢠Publishing and interlinking of data
⢠Different quality and purpose
⢠From different sources in the Web
World Wide Web Linked Data
Documents Data
Hyperlinks Typed Links
HTML RDF
Addresses (URIs) Addresses (URIs)
Example: http://www.uio.no/
SchemEX â Building an Index for LOD Slide 7 of 44
8. Relevance of Linked Data?
SchemEX â Building an Index for LOD Slide 8 of 44
9. Linked Data: May â07 ď Sept. â11
Web 2.0
Media
Publications
eGovernment
Cross-Domain
Life
Geographic Sciences
SchemEX â Building an Index for LOD
< 31 Billion Triples Slide 9 of 44 Source: http://lod-cloud.net
10. Linked Data Principles
1. Identification
2. Interlinkage
3. Dereferencing
4. Description
SchemEX â Building an Index for LOD Slide 10 of 44
11. Example: Big Lynx
Matt Briggs
Scott Miller
?
Big Lynx
Company
SchemEX â Building an Index
< 31 Milliarde Triple for LOD Slide 11 of 44 Source: http://lod-cloud.net
12. 1. Use URIs for Identification
Matt Briggs
Scott Miller
http://biglynx.co.uk/
people/matt-briggs
http://biglynx.co.uk/
people/scott-miller
SchemEX â Building B. Gazen,http://www.flickr.com/photos/bayat/,12 of 44
an Index for LOD Slide CC-BY
13. Example: Big Lynx
Matt Briggs
Scott Miller
Big Lynx
Company
ď§ How to model relationships like knows?
SchemEX â Building an Index for LOD Slide 13 of 44
14. Resource DescriptionFramework (RDF)
⢠Description of Ressources with RDF triple
Matt Briggs is a Person
Subject Predicate Object
@prefix rdf:<http://w3.org/1999/02/22-rdf-
syntax-ns#> .
@prefix foaf:<http://xmlns.com/foaf/0.1/> .
<http://biglynx.co.uk/people/matt-briggs>
rdf:type foaf:Person .
SchemEX â Building an Index for LOD Slide 14 of 44
15. 1. Use URIs also for Relations
http://biglynx.co.uk/
people/matt-briggs
http://biglynx.co.uk/
people/scott-miller
SchemEX â Building B. Gazen,http://www.flickr.com/photos/bayat/,15 of 44
an Index for LOD Slide CC-BY
16. Example: Big Lynx
Dave Smith
London
âlives hereâ
Matt Briggs
âsame
Scott Miller
Big Lynx
⌠personâ
Company
DBpedia Matt Briggs
Matts private
Webseite
SchemEX â Building an Index for LOD Slide 16 of 44
17. 2. Establishing Interlinkage
⢠Relation links between ressources
<http://biglynx.co.uk/people/dave-smith>
foaf:based_near
<http://dbpedia.org/resource/London> .
ď§ Identity links between ressources
<http://biglynx.co.uk/people/matt-briggs>
owl:sameAs
<http://www.matt-briggs.eg.uk#me> .
SchemEX â Building an Index for LOD Slide 17 of 44
18. Example: Big Lynx
Dave Smith
London
âlives hereâ
foaf:based_near
Matt Briggs
âsame
owl:sameAs
Personâ Big Lynx
Company
DBpedia Matt Briggs
Matts private
Webseite
SchemEX â Building an Index for LOD Slide 18 of 44
19. 3. Dereferencing of URIs
⢠Looking up of web documents
⢠How can we âlook upâ things of the real world?
http://biglynx.co.uk/
people/matt-briggs
SchemEX â Building an Index for LOD Slide 19 of 44
20. Two Approaches
1. Hash URIs
â URI contains a part separated by #, e.g.,
http://biglynx.co.uk/vocab/sme#Team
2. Negotiation via â303 See Otherâ request
http://biglynx.co.uk/people/matt-briggs
Response: âLook here:â
http://biglynx.co.uk/people/matt-briggs.rdf
SchemEX â Building an Index for LOD Slide 20 of 44
21. Example: Big Lynx
Dave Smith
London
foaf:based_near
Description of
Matt Briggs
Matt?
owl:sameAs
Big Lynx
Company
DBpedia Matt Briggs
Matts private
Webseite
SchemEX â Building an Index for LOD Slide 21 of 44
22. 4. Description of URIs
foaf:Person âŚ
⌠dp:Birmingham
rdf:type
foaf:based_near âŚ
biglynx:matt-briggs ex:loc
_:point
foaf:knows
wgs84:
wgs84: long
biglynx:dave-smith
lat
â-0.118â
foaf:based_near
â51.509â
dp:London
⌠âŚ
SchemEX â Building an Index for LOD Slide 22 of 44
23. RDF / RDF Schema Vocabulary
⢠Set of URIs defined in rdf:/rdfs: namespace
⢠rdf:type ⢠rdfs:domain
⢠rdf:Property ⢠rdfs:range
⢠rdf:XMLLiteral ⢠rdfs:Resource
⢠rdf:List ⢠rdfs:Literal
⢠rdf:first ⢠rdfs:Datatype
⢠rdf:rest ⢠rdfs:Class
⢠rdf:Seq ⢠rdfs:subClassOf
⢠rdf:Bag ⢠rdfs:subPropertyOf
⢠rdf:Alt ⢠rdfs:comment
⢠... ⢠âŚ
⢠rdf:value ⢠rdfs:label
SchemEX â Building an Index for LOD Slide 23 of 44
24. Semantic Web Layer Cake (Simplified)
SchemEX â Building an Index for LOD Slide 24 of 44
25. Learning Goals
⢠Understand the motivation and
fundamentals of Linked Open Data (LOD).
⢠Qualify in why an index for LOD is needed
and how to efficiently create such an index.
SchemEX â Building an Index for LOD Slide 25 of 44
26. Scenario
⢠People who are politicians and actors
⢠Who else?
⢠Where do they live?
⢠Whom do they know? âŚare they married with?
SchemEX â Building an Index for LOD Slide 26 of 44
27. Problem
⢠No single federated query interface provided
⢠Execute those queries on the LOD cloud
SELECT ?x
FROM âŚ
WHERE {
?x rdf:type ex:Actor .
?x rdf:type ex:Politician .
}
âpoliticians
and actorsâ
SchemEX â Building an Index for LOD Slide 27 of 44
28. Principle Solution
⢠Suitable index structure for looking up sources
âpoliticians
and actorsâ
SchemEX â Building an Index for LOD Slide 28 of 44
29. The Naive Approach
1. Download the entire LOD cloud
2. Put it into a (really) large triple store
3. Process the data and extract schema
4. Provide lookup
- Big machinery
- Late in processing the data
- High effort to scale with LOD cloud
SchemEX â Building an Index for LOD Slide 29 of 44
30. Idea
ď§ Schema-level index
ďˇ Define families of graph patterns
ďˇ Assign instances to graph patterns
ďˇ Map graph patterns to context (source URI)
ď§ Construction
ďˇ Stream-based for scalability
ďˇ Little loss of accuracy
ď§ Note
ďˇ Index defined over instances
ďˇ But stores the context
SchemEX â Building an Index for LOD Slide 30 of 44
31. Input Data
ď§ n-Quads
<subject> <predicate> <object> <context>
ď§ Example:
<http://www.w3.org/People/Connolly/#me>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#
<http://xmlns.com/foaf/0.1/Person>
<http://dig.csail.mit.edu/2008/webdav/timbl/
http://dig.csail.mit.edu/2008/
webdav/timbl/foaf.rdf
w3p:
#me
foaf:
Person
SchemEX â Building an Index for LOD Slide 31 of 44
32. Building the Schema and Index
RDF
C1 C2 C3 ⌠Ck
classes
consistsOf
Type
TC1 TC2 ⌠TCm clusters
hasEQ
Class p1 p2
EQC1 EQC2 ⌠EQCn Equivalence
classes
hasDataSource
⌠Data
DS1 DS2 DS3 DS4 DS5 DSx sources
SchemEX â Building an Index for LOD Slide 32 of 44
33. Layer 1: RDF Classes
ď§ All instances of a C1
particular type
DS 1 DS 2 DS 3
SELECT ?x
FROM âŚ
WHERE {
?x rdfs:type foaf:Person .
foaf:Person
}
http://dig.csail.mit.edu/2008/...
foaf:
timbl: Person
card#i http://www.w3.org/People/Berners-Lee/card
SchemEX â Building an Index for LOD Slide 33 of 44
34. Layer 2: Type Clusters
ď§ All instances belonging C1 C2
to exactly the same set
TC1
of types
SELECT ?x DS 1 DS 2 DS 3
FROM âŚ
WHERE {
foaf:Person pim:Male
?x rdfs:type foaf:Person .
?x rdfs:type pim:Male . tc4711
}
pim:
Male
http://www.w3.org/People/Berners-Lee/card
foaf:
timbl:
Person
card#i
SchemEX â Building an Index for LOD Slide 34 of 44
35. Layer 3: Equivalence Classes
ď§ Two instances are C1 C2 C3
equivalent iff:
ďˇ They are in the same TC TC1 TC2
ďˇ They have the same p
properties
EQC1
ďˇ The property targets are
in the same TC DS 1 DS 2 DS 3
ďˇ Similar to 1-Bisimulation
SchemEX â Building an Index for LOD Slide 35 of 44
36. Layer 3: Equivalence Classes
SELECT ?x
WHERE {
?x rdfs:type foaf:Person foaf:Person
.
?x rdfs:type pim:Male . pim:Male foaf:PPD
?x foaf:maker ?y .
?y rdfs:type
foaf:PersonalProfileDocument .
tc4711 tc1234
} eqc0815
-maker-
pim: foaf: foaf: tc1234
Male Person PPD
eqc0815
foaf:maker
timbl: http://www.w3.org/People/Berners-Lee/card
timbl: card
card#i
SchemEX â Building an Index for LOD Slide 36 of 44
37. The SchemEX Approach
⢠Stream-based schema extraction
⢠While crawling the data
FIFO
LOD-Crawler Instance-
RDF-Dump Cache RDF
Triple Store RDBMS
NxParser
Nquad- Schema- Schema-
Parser
Stream Extractor Level
Index
SchemEX â Building an Index for LOD Slide 37 of 44
38. Building the Index from a Stream
ď§ Stream of n-quads (coming from a LD crawler)
⌠Q16, Q15, Q14, Q13, Q12, Q11, Q10, Q9, Q8, Q7, Q6, Q5, Q4, Q3, Q2, Q1
FiFo
1
C3 4
6
C2 3
4
2
C2 2
1 3
C1 5
⢠Linear runtime complexity wrt # of input triples
SchemEX â Building an Index for LOD Slide 38 of 44
39. Computing SchemEX: TimBL Data Set
⢠Analysis of a smaller data set
⢠11 M triples, TimBLâs FOAF profile
⢠LDspider with ~ 2k triples / sec
⢠Different cache sizes: 100, 1k, 10k, 50k, 100k
⢠Compared SchemEX with reference schema
⢠Index queries on all Types, TCs, EQCs
⢠Good precision/recall ratio at 50k+
SchemEX â Building an Index for LOD Slide 39 of 44
40. Quality of Stream-based Index
Construction
⢠Runtime increases hardly with window size
⢠Memory consumption scales with window size
SchemEX â Building an Index for LOD Slide 40 of 44
41. Computing SchemEX: Full BTC 2011 Data
Cache size: 50 k
SchemEX â Building an Index for LOD Slide 41 of 44
43. Conclusions: SchemEX
⢠Linked Open Data (LOD) approach
⢠Publishing and interlinking data on the web
⢠SchemEX
⢠Stream-based approach to LOD schema
extraction
⢠Scalable to arbitrary amount of Linked Data
⢠Applicable on commodity hardware
(4GB RAM, single CPU)
SchemEX â Building an Index for LOD Slide 43 of 44
44. Learning Goals
⢠Understand the motivation and
fundamentals of Linked Open Data (LOD).
⢠Qualify in why an index for LOD is needed
and how to efficiently create such an index.
SchemEX â Building an Index for LOD Slide 44 of 44
45. Recommended Readings
⢠Maciej Janik, Ansgar Scherp, Steffen Staab: The Semantic Web:
Collective Intelligence on the Web. Informatik Spektrum 34(5): 469-483
(2011)
URL: http://dx.doi.org/10.1007/s00287-011-0535-x
⢠Mathias Konrath, Thomas Gottron, Steffen Staab, Ansgar Scherp:
SchemEX â Efficient construction of a data catalogue by stream-based
indexing of linked data, J. of Web Semantics: Science, Services and
Agents on the World Wide Web, Available online 23 June 2012
URL: http://www.sciencedirect.com/science/article/pii/S1570826812000716
⢠Tom Heath, Christian Bizer: Linked Data: Evolving the Web into a Global
Data Space, Morgan & Claypool Publishers, 2011
URL: http://dx.doi.org/10.2200/S00334ED1V01Y201102WBE001
SchemEX â Building an Index for LOD Slide 45 of 44