SlideShare ist ein Scribd-Unternehmen logo
1 von 24
ABSTAT: Ontology-driven Linked Data
Summaries with Pattern Minimalization
Blerina Spahiu, Riccardo Porrini, Matteo Palmonari, Anisa Rula, Andrea Maurino
University of Milano-Bicocca (surname@disco.unimib.it)
blerina.spahiu@disco.unimib.it
Outline
 Motivation
 Dataset Understanding
 State of the Art
 Summarization Framework
 Abstract Knowledge Patterns (AKPs)
 Pattern Minimalization
 Summary extraction, storage and presentation
 Evaluation
 Compactness
 Informativeness
 User Study
 Conclusion and Future Work
2University of Milan - Bicocca
Introduction
What types of resources are there in a data set?
How are they described?
What types of resources are linked by a certain
property and how frequently?
Motivation
 Understanding the content of data sets is challenging
 Looking at the ontology is not enough:
Ontologies may be large and underspecified
• DBpedia 2015-04: 2795 properties, domain not
specified for 259 properties, range not specified for
187 properties
• No information about the usage
 Explorative queries are too expensive
Significant server overload
High response time/timeout
State of the Art
University of Milan - Bicocca 5
Relevance Based Summarization Pattern Based Approaches
Troullinoy et al. 2015
Zhang et al. 2007
Identifying subsets of data sets or
ontologies that are considered to
be more relevant
Aim at extracting knowledge
patterns for a complete
representation of the data set
Mihindukulasooriya et al. 2015
Persutti et al. 2011
M. Jarrar and M. Dikaiakos, 2012
Schema Induction
Induces a schema from the data
and aim at extracting stronger
assertions
Völker and Niepert, 2011
Statistics about the dataset
Konrath et. al 2012
Langegger and W. Wöb, 2009
Auer et al. 2012
Linked Open Vocabularies
(http://lov.okfn.org/)
Aim at reporting statistics about the
usage of different vocabularies,
properties and types in the data
State of the Art
University of Milan - Bicocca 6
Relevance Based Summarization Pattern Based Approaches
Troullinoy et al. 2015
Zhang et al. 2007
Identifying subsets of data sets or
ontologies that are considered to
be more relevant.
Aim at extracting knowledge
patterns for a complete
rapresentation of the dataset.
Mihindukulasooriya et al. 2015
Persutti et al. 2011
M. Jarrar and M. Dikaiakos, 2012
Schema Induction
Induces a schema from the data
and aim at extracting stronger
assertions.
Völker and Niepert, 2011
Statistics about the dataset
Konrath et. al 2012
Langegger and W. Wöb, 2009
Auer et al. 2012
Linked Open Vocabularies
(http://lov.okfn.org/)
Aim at reporting statistics about the
usage of different vocabularies,
properties and types in the data.
ABSTAT
ABSTAT
 ABSTAT (http://abstat.disco.unimib.it) is an ontology-driven
linked data summarization framework
 A summary provides a complete but compact schema-level
representation of a data set
 A set of Abstract Knowledge Patterns (AKPs)
 Statistics An AKP represents the fact that there are instance
of type Person linked with instances of type
Settlement by the property birthplace
How many times does this
pattern occur in the data set
How many times does a certain type occur as minimal type
and how many time does the property occur in the dataset
Abstract Knowledge Patterns (AKPs)
 ABSTAT adopts a minimalization mechanism based on
minimal type patterns
 Minimalization is based on a subtype graph which represents
the data ontology
 Abstract Knowledge Patterns (AKPs) are abstract
representations of Knowledge Patterns
 An AKP is a triple (C; P; D ) such that C and D are types and
P is a property
 In ABSTAT we represent only a set of AKP occurring in the
data set, those that are minimal types
Person
Sportist
FootballPlayer
Lawyer
Jim Brown
Amal
Clooney
“1936-02-17”
XMLSchema#Date
hasWife
Artist
George
Clooney
birthDate
= types
= instances
= literals
.
subclassOf
subclassOf
subclassOf
subclassOf
type
type
type
The (minimal-type) patterns extracted by ABSTAT are:
<Artist, hasWife, Lawyer>
<FootballPlayer, birthDate, XMLSchema#Date>
(type)
An example how AKPs are extracted
type
type
type
Person
Sportist
FootballPlayer
Lawyer
Jim Brown
Amal
Clooney
“1936-02-17”
XMLSchema#Date
hasWife
Artist
George
Clooney
birthDate
= types
= instances
= literals
.
subclassOf
subclassOf
subclassOf
subclassOf
type
type
type
The (minimal-type) patterns extracted by ABSTAT are:
<Artist, hasWife, Lawyer>
<FootballPlayer, birthDate, XMLSchema#Date>
(type)
An example how AKPs are extracted
type
type
type
Redundant patterns excluded by the summary:
<Person, hasWife, Person>
<Sportist, birthDate, XMLSchema#Date>
<Person, birthDate, XMLSchema#Date>
Person
Sportist
FootballPlayer
Lawyer
Jim Brown
Amal
Clooney
“1936-02-17”
XMLSchema#Date
hasWife
Artist
George
Clooney
birthDate
= types
= instances
= literals
.
subclassOf
subclassOf
subclassOf
subclassOf
type
type
type
The (minimal-type) patterns extracted by ABSTAT are:
<Artist, hasWife, Lawyer>
<FootballPlayer, birthDate, XMLSchema#Date>
<Artist, birthDate, XMLSchema#Date>
(type)
An example how AKPs are extracted
type
type
type
type
Summary Extraction Workflow
ABSTAT User Interfaces
ABSTAT homepage
(http://abstat.disco.unimib.it)
ABSTATBrowse
(http://abstat.disco.unimib.it/browse)
ABSTATSearch
(http://abstat.disco.unimib.it/search)
SPARQL Endpoint
(http://abstat.disco.unimib.it/sparql)
University of Milan - Bicocca 13
Experimental Evaluation
 Summary compactness
 Number of patterns in the summary vs. number of triples in the
data set
 Comparison with a similar approach without minimalization
 Summary informativeness
 Insights about the semantics of the properties
 Small-scale user study
Compactness
Dataset Relational Typing Assertions Types (Ext.) Properties (Ext.) Patterns
DBpedia Core 2014 40.5M 29.7M 70.1M 869 (85) 1439 (15) 171340
DBpedia 3.9 Infobox 96.3M 19.7M 116.4M 821 (58) 62572 (14) 732418
Linked Brainz 180.1M 39.6M 221.7M 21 (9) 33 (0) 161
Reduction Rate =
Dataset ABSTAT LOUPE
DBpedia Core 2014 0.002 0.01
Linked Brainz 6.72 10-7 7.1 10-7
 Minimalization produces more compact summaries
 Advantage of minimalization is more observable for datasets with
richer subtype graphs and typing assertions
Data sets and summaries statistics
Reduction rate
Number of patterns
Number of assertions in the data set
Similar to ABSTAT without
minimalization
Informativeness
ABSTAT summaries provide useful insights about the semantics
of properties, based on their usage within a data set
Dataset Missing
Domain (%)
Missing
Range (%)
Missing
Domain & Range (%)
DBpedia Core 2014 259 (18%) 187 (13%) 48 (3.3%)
DBpedia 3.9 Infobox 61368 (98%) 61309 (98%) 61161 (97%)
Linked Brainz 13 (39%) 15 (45%) 13 (39%)
Inferred domain and range for DBpedia Core 2014
0
20
40
60
80
100
120
140
160
dbo:type
dbo:religion
dbo:succes…
dbo:predec…
dbo:division
dbo:picture
dbo:isPartOf
dbo:position
dbo:series
dbo:builder
dbo:gender
dbo:category
dbo:source
dbo:jurisdic…
dbo:localAu…
dbo:biome
dbo:royalA…
dbo:orogeny
dbo:mainInt…
dbo:authority
dbo:chairLa…
dbo:era
dbo:format
dbo:founder
dbo:manag…
dbo:webcast
dbo:related
dbo:similar
dbo:hasVar…
dbo:sportG…
dbo:variantOf
dbo:connot…
dbo:named…
Numberofminimaltypes
Extracted minimal types (domain)
Extracted minimal types (range)
User Study: Setup
Can ABSTAT be useful to support query formulation?
 Queries to DBpedia 3.9 Infobox from the Questions and
Answering in Linked Open Data benchmark
5 queries of increasing length (1 of length 1, 2 of length 2
and 2 of length 3)
 20 participants, 2 groups:
abstat group uses ABSTAT (after 20 min of training)
control group does not use ABSTAT
 Measures:
Time needed to formulate the query
Accuracy of the answer
User Study: Questionnaire
University of Milan - Bicocca 19
User Study: Results
Group Avg. Completion Time (s) Accuracy
Query 1- length 1 How many employees does Google
have?
abstat 358.9 0.9
control 380.6 0.8
Query 2- length 2 Give me all people that were born in Vienna and died in Berlin.
abstat 356.3 1
control 346.9 0.8
Query 3- length 2 Which professional surfers were born in
Australia?
abstat 476.6 0.6
control 234.24 0.7
Query 4- length 3 In which films directed by Gary Marshall was Julia Roberts
starring?
abstat 333.4 0.9
control 445.6 0.9
Query 5- length 3 Give me all books by William Goldman with more than 300
pages.
abstat 233.4 1
control 569.8 0.7
The independent t-test showed that there was a significant effect between two
groups for answering correctly Q5: t(16) = 10.32, p < .005
User Study: Results Analysis
 abstat group users benefit from ABSTAT summary in terms of
average completion time, accuracy, or both
 Increasing accuracy over increasing difficulty, performing the tasks faster
 Exception is query 3, because the individual Surfing is classified with no
type other than owl:Thing
 Two used strategies to answer the queries by participants
from the control group were:
 To directly access the public web page describing the DBpedia named
individuals mentioned in the query
 Very few submitted explorative SPARQL queries to the endpoint
Conclusion and Future Work
 ABSTAT: ontology-driven summarization with minimalization
 Sensible reduction rate and promising results about the
informativeness of the summary
 Currently extending the user study
 Apply relevance-oriented summarization methods based
on connectivity analysis
 ABSTAT summary should consider the inheritance of properties
to produce even more compact summaries
 We envision a complete analysis of the most important data set
available in the LOD cloud (20+ data sets available)
 APIs available soon
Thank you for your attention!
23University of Milan - Bicocca
www.abstat.unimib.it
University of Milan - Bicocca 24

Weitere ähnliche Inhalte

Was ist angesagt?

Weblog Extraction With Fuzzy Classification Methods
Weblog Extraction With Fuzzy Classification MethodsWeblog Extraction With Fuzzy Classification Methods
Weblog Extraction With Fuzzy Classification MethodsEdy Portmann
 
Multidimensioal database
Multidimensioal  databaseMultidimensioal  database
Multidimensioal databaseTPO TPO
 
Ontology integration - Heterogeneity, Techniques and more
Ontology integration - Heterogeneity, Techniques and moreOntology integration - Heterogeneity, Techniques and more
Ontology integration - Heterogeneity, Techniques and moreAdriel Café
 
Ontology For Data Integration
Ontology For Data IntegrationOntology For Data Integration
Ontology For Data Integrationjuanesteva
 
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...Khirulnizam Abd Rahman
 
Higher Order Learning
Higher Order LearningHigher Order Learning
Higher Order Learningbutest
 
Towards a Query Rewriting Algorithm Over Proteomics XML Resources
Towards a Query Rewriting Algorithm Over Proteomics XML ResourcesTowards a Query Rewriting Algorithm Over Proteomics XML Resources
Towards a Query Rewriting Algorithm Over Proteomics XML ResourcesCSCJournals
 
Hyponymy extraction of domain ontology
Hyponymy extraction of domain ontologyHyponymy extraction of domain ontology
Hyponymy extraction of domain ontologyIJwest
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Rakebul Hasan
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text miningIRJET Journal
 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHIJDKP
 
A new keyphrases extraction method based on suffix tree data structure for ar...
A new keyphrases extraction method based on suffix tree data structure for ar...A new keyphrases extraction method based on suffix tree data structure for ar...
A new keyphrases extraction method based on suffix tree data structure for ar...ijma
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISrathnaarul
 
A Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
A Survey Ondecision Tree Learning Algorithms for Knowledge DiscoveryA Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
A Survey Ondecision Tree Learning Algorithms for Knowledge DiscoveryIJERA Editor
 

Was ist angesagt? (19)

Weblog Extraction With Fuzzy Classification Methods
Weblog Extraction With Fuzzy Classification MethodsWeblog Extraction With Fuzzy Classification Methods
Weblog Extraction With Fuzzy Classification Methods
 
Multidimensioal database
Multidimensioal  databaseMultidimensioal  database
Multidimensioal database
 
Ontology integration - Heterogeneity, Techniques and more
Ontology integration - Heterogeneity, Techniques and moreOntology integration - Heterogeneity, Techniques and more
Ontology integration - Heterogeneity, Techniques and more
 
Text mining
Text miningText mining
Text mining
 
Ontology For Data Integration
Ontology For Data IntegrationOntology For Data Integration
Ontology For Data Integration
 
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
 
Higher Order Learning
Higher Order LearningHigher Order Learning
Higher Order Learning
 
E43022023
E43022023E43022023
E43022023
 
Towards a Query Rewriting Algorithm Over Proteomics XML Resources
Towards a Query Rewriting Algorithm Over Proteomics XML ResourcesTowards a Query Rewriting Algorithm Over Proteomics XML Resources
Towards a Query Rewriting Algorithm Over Proteomics XML Resources
 
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERINGAN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
 
Ontology
OntologyOntology
Ontology
 
Hyponymy extraction of domain ontology
Hyponymy extraction of domain ontologyHyponymy extraction of domain ontology
Hyponymy extraction of domain ontology
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text mining
 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
 
A new keyphrases extraction method based on suffix tree data structure for ar...
A new keyphrases extraction method based on suffix tree data structure for ar...A new keyphrases extraction method based on suffix tree data structure for ar...
A new keyphrases extraction method based on suffix tree data structure for ar...
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
 
A Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
A Survey Ondecision Tree Learning Algorithms for Knowledge DiscoveryA Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
A Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
 

Andere mochten auch

Modular Ontologies - A Formal Investigation of Semantics and Expressivity
Modular Ontologies - A Formal Investigation of Semantics and ExpressivityModular Ontologies - A Formal Investigation of Semantics and Expressivity
Modular Ontologies - A Formal Investigation of Semantics and ExpressivityJie Bao
 
The General Ontology Evaluation Framework (GOEF) & the I-Choose Use Case A ...
The General Ontology Evaluation Framework (GOEF) & the I-Choose Use CaseA ...The General Ontology Evaluation Framework (GOEF) & the I-Choose Use CaseA ...
The General Ontology Evaluation Framework (GOEF) & the I-Choose Use Case A ...Joanne Luciano
 
OWL Web Ontology Language Overview
OWL Web Ontology Language OverviewOWL Web Ontology Language Overview
OWL Web Ontology Language OverviewIgor Myroshnichenko
 
IAS 16 Ontology Dojo
IAS 16 Ontology DojoIAS 16 Ontology Dojo
IAS 16 Ontology DojoRen Pope
 
Schema.org: Where did that come from!
Schema.org: Where did that come from!Schema.org: Where did that come from!
Schema.org: Where did that come from!Richard Wallis
 
On Beyond OWL: challenges for ontologies on the Web
On Beyond OWL: challenges for ontologies on the WebOn Beyond OWL: challenges for ontologies on the Web
On Beyond OWL: challenges for ontologies on the WebJames Hendler
 
Semantic Web - Ontology 101
Semantic Web - Ontology 101Semantic Web - Ontology 101
Semantic Web - Ontology 101Luigi De Russis
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosEUCLID project
 
DBpedia Tutorial - Feb 2015, Dublin
DBpedia Tutorial - Feb 2015, DublinDBpedia Tutorial - Feb 2015, Dublin
DBpedia Tutorial - Feb 2015, Dublinm_ackermann
 

Andere mochten auch (12)

Deliverable_5.1.2
Deliverable_5.1.2Deliverable_5.1.2
Deliverable_5.1.2
 
Modular Ontologies - A Formal Investigation of Semantics and Expressivity
Modular Ontologies - A Formal Investigation of Semantics and ExpressivityModular Ontologies - A Formal Investigation of Semantics and Expressivity
Modular Ontologies - A Formal Investigation of Semantics and Expressivity
 
The General Ontology Evaluation Framework (GOEF) & the I-Choose Use Case A ...
The General Ontology Evaluation Framework (GOEF) & the I-Choose Use CaseA ...The General Ontology Evaluation Framework (GOEF) & the I-Choose Use CaseA ...
The General Ontology Evaluation Framework (GOEF) & the I-Choose Use Case A ...
 
OWL Web Ontology Language Overview
OWL Web Ontology Language OverviewOWL Web Ontology Language Overview
OWL Web Ontology Language Overview
 
IAS 16 Ontology Dojo
IAS 16 Ontology DojoIAS 16 Ontology Dojo
IAS 16 Ontology Dojo
 
Schema.org: Where did that come from!
Schema.org: Where did that come from!Schema.org: Where did that come from!
Schema.org: Where did that come from!
 
On Beyond OWL: challenges for ontologies on the Web
On Beyond OWL: challenges for ontologies on the WebOn Beyond OWL: challenges for ontologies on the Web
On Beyond OWL: challenges for ontologies on the Web
 
Wither OWL
Wither OWLWither OWL
Wither OWL
 
Semantic Web - Ontology 101
Semantic Web - Ontology 101Semantic Web - Ontology 101
Semantic Web - Ontology 101
 
2016 bmdid-mappings
2016 bmdid-mappings2016 bmdid-mappings
2016 bmdid-mappings
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application Scenarios
 
DBpedia Tutorial - Feb 2015, Dublin
DBpedia Tutorial - Feb 2015, DublinDBpedia Tutorial - Feb 2015, Dublin
DBpedia Tutorial - Feb 2015, Dublin
 

Ähnlich wie ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization

LinkSUM: Using Link Analysis to Summarize Entity Data
LinkSUM: Using Link Analysis to Summarize Entity DataLinkSUM: Using Link Analysis to Summarize Entity Data
LinkSUM: Using Link Analysis to Summarize Entity DataAndreas Thalhammer
 
Profiling Linked Open Data
Profiling Linked Open DataProfiling Linked Open Data
Profiling Linked Open DataBlerina Spahiu
 
Managing textual data semantically in relational databases by wael yahfooz an...
Managing textual data semantically in relational databases by wael yahfooz an...Managing textual data semantically in relational databases by wael yahfooz an...
Managing textual data semantically in relational databases by wael yahfooz an...SK Ahammad Fahad
 
Driving Deep Semantics in Middleware and Networks: What, why and how?
Driving Deep Semantics in Middleware and Networks: What, why and how?Driving Deep Semantics in Middleware and Networks: What, why and how?
Driving Deep Semantics in Middleware and Networks: What, why and how?Amit Sheth
 
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...tmra
 
Dynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & StatisticsDynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & StatisticsPaul Hofmann
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Andy Petrella
 
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataA Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataStuart Chalk
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
 
Linked Data Entity Summarization (PhD defense)
Linked Data Entity Summarization (PhD defense)Linked Data Entity Summarization (PhD defense)
Linked Data Entity Summarization (PhD defense)Andreas Thalhammer
 
Lec1-Into
Lec1-IntoLec1-Into
Lec1-Intobutest
 
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...Marko Rodriguez
 
Data Science, Data & Dashboards Design
Data Science, Data & Dashboards DesignData Science, Data & Dashboards Design
Data Science, Data & Dashboards DesignKoo Ping Shung
 
Collnet turkey feroz-core_scientific domain
Collnet turkey feroz-core_scientific domainCollnet turkey feroz-core_scientific domain
Collnet turkey feroz-core_scientific domainHan Woo PARK
 
Collnet _Conference_Turkey
Collnet _Conference_TurkeyCollnet _Conference_Turkey
Collnet _Conference_TurkeyGohar Feroz Khan
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceUniversity of Washington
 
Aspects of broad folksonomies
Aspects of broad folksonomiesAspects of broad folksonomies
Aspects of broad folksonomiesdermotte
 

Ähnlich wie ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization (20)

LinkSUM: Using Link Analysis to Summarize Entity Data
LinkSUM: Using Link Analysis to Summarize Entity DataLinkSUM: Using Link Analysis to Summarize Entity Data
LinkSUM: Using Link Analysis to Summarize Entity Data
 
Profiling Linked Open Data
Profiling Linked Open DataProfiling Linked Open Data
Profiling Linked Open Data
 
Managing textual data semantically in relational databases by wael yahfooz an...
Managing textual data semantically in relational databases by wael yahfooz an...Managing textual data semantically in relational databases by wael yahfooz an...
Managing textual data semantically in relational databases by wael yahfooz an...
 
Driving Deep Semantics in Middleware and Networks: What, why and how?
Driving Deep Semantics in Middleware and Networks: What, why and how?Driving Deep Semantics in Middleware and Networks: What, why and how?
Driving Deep Semantics in Middleware and Networks: What, why and how?
 
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
 
Dynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & StatisticsDynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & Statistics
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
 
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataA Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 
Data models
Data modelsData models
Data models
 
Linked Data Entity Summarization (PhD defense)
Linked Data Entity Summarization (PhD defense)Linked Data Entity Summarization (PhD defense)
Linked Data Entity Summarization (PhD defense)
 
Lec1-Into
Lec1-IntoLec1-Into
Lec1-Into
 
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
 
Eurolan 2005 Pedersen
Eurolan 2005 PedersenEurolan 2005 Pedersen
Eurolan 2005 Pedersen
 
Data Science, Data & Dashboards Design
Data Science, Data & Dashboards DesignData Science, Data & Dashboards Design
Data Science, Data & Dashboards Design
 
Collnet turkey feroz-core_scientific domain
Collnet turkey feroz-core_scientific domainCollnet turkey feroz-core_scientific domain
Collnet turkey feroz-core_scientific domain
 
Collnet _Conference_Turkey
Collnet _Conference_TurkeyCollnet _Conference_Turkey
Collnet _Conference_Turkey
 
Collnet_Conference_Turkey
Collnet_Conference_TurkeyCollnet_Conference_Turkey
Collnet_Conference_Turkey
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
 
Aspects of broad folksonomies
Aspects of broad folksonomiesAspects of broad folksonomies
Aspects of broad folksonomies
 

Kürzlich hochgeladen

SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 
IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119APNIC
 
ETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxNIMMANAGANTI RAMAKRISHNA
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxmibuzondetrabajo
 
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxMario
 
Cybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best PracticesCybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best PracticesLumiverse Solutions Pvt Ltd
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxAndrieCagasanAkio
 

Kürzlich hochgeladen (9)

SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 
IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119
 
ETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptx
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptx
 
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptx
 
Cybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best PracticesCybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best Practices
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptx
 

ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization

  • 1. ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization Blerina Spahiu, Riccardo Porrini, Matteo Palmonari, Anisa Rula, Andrea Maurino University of Milano-Bicocca (surname@disco.unimib.it) blerina.spahiu@disco.unimib.it
  • 2. Outline  Motivation  Dataset Understanding  State of the Art  Summarization Framework  Abstract Knowledge Patterns (AKPs)  Pattern Minimalization  Summary extraction, storage and presentation  Evaluation  Compactness  Informativeness  User Study  Conclusion and Future Work 2University of Milan - Bicocca
  • 3. Introduction What types of resources are there in a data set? How are they described? What types of resources are linked by a certain property and how frequently?
  • 4. Motivation  Understanding the content of data sets is challenging  Looking at the ontology is not enough: Ontologies may be large and underspecified • DBpedia 2015-04: 2795 properties, domain not specified for 259 properties, range not specified for 187 properties • No information about the usage  Explorative queries are too expensive Significant server overload High response time/timeout
  • 5. State of the Art University of Milan - Bicocca 5 Relevance Based Summarization Pattern Based Approaches Troullinoy et al. 2015 Zhang et al. 2007 Identifying subsets of data sets or ontologies that are considered to be more relevant Aim at extracting knowledge patterns for a complete representation of the data set Mihindukulasooriya et al. 2015 Persutti et al. 2011 M. Jarrar and M. Dikaiakos, 2012 Schema Induction Induces a schema from the data and aim at extracting stronger assertions Völker and Niepert, 2011 Statistics about the dataset Konrath et. al 2012 Langegger and W. Wöb, 2009 Auer et al. 2012 Linked Open Vocabularies (http://lov.okfn.org/) Aim at reporting statistics about the usage of different vocabularies, properties and types in the data
  • 6. State of the Art University of Milan - Bicocca 6 Relevance Based Summarization Pattern Based Approaches Troullinoy et al. 2015 Zhang et al. 2007 Identifying subsets of data sets or ontologies that are considered to be more relevant. Aim at extracting knowledge patterns for a complete rapresentation of the dataset. Mihindukulasooriya et al. 2015 Persutti et al. 2011 M. Jarrar and M. Dikaiakos, 2012 Schema Induction Induces a schema from the data and aim at extracting stronger assertions. Völker and Niepert, 2011 Statistics about the dataset Konrath et. al 2012 Langegger and W. Wöb, 2009 Auer et al. 2012 Linked Open Vocabularies (http://lov.okfn.org/) Aim at reporting statistics about the usage of different vocabularies, properties and types in the data. ABSTAT
  • 7. ABSTAT  ABSTAT (http://abstat.disco.unimib.it) is an ontology-driven linked data summarization framework  A summary provides a complete but compact schema-level representation of a data set  A set of Abstract Knowledge Patterns (AKPs)  Statistics An AKP represents the fact that there are instance of type Person linked with instances of type Settlement by the property birthplace How many times does this pattern occur in the data set How many times does a certain type occur as minimal type and how many time does the property occur in the dataset
  • 8. Abstract Knowledge Patterns (AKPs)  ABSTAT adopts a minimalization mechanism based on minimal type patterns  Minimalization is based on a subtype graph which represents the data ontology  Abstract Knowledge Patterns (AKPs) are abstract representations of Knowledge Patterns  An AKP is a triple (C; P; D ) such that C and D are types and P is a property  In ABSTAT we represent only a set of AKP occurring in the data set, those that are minimal types
  • 9. Person Sportist FootballPlayer Lawyer Jim Brown Amal Clooney “1936-02-17” XMLSchema#Date hasWife Artist George Clooney birthDate = types = instances = literals . subclassOf subclassOf subclassOf subclassOf type type type The (minimal-type) patterns extracted by ABSTAT are: <Artist, hasWife, Lawyer> <FootballPlayer, birthDate, XMLSchema#Date> (type) An example how AKPs are extracted type type type
  • 10. Person Sportist FootballPlayer Lawyer Jim Brown Amal Clooney “1936-02-17” XMLSchema#Date hasWife Artist George Clooney birthDate = types = instances = literals . subclassOf subclassOf subclassOf subclassOf type type type The (minimal-type) patterns extracted by ABSTAT are: <Artist, hasWife, Lawyer> <FootballPlayer, birthDate, XMLSchema#Date> (type) An example how AKPs are extracted type type type Redundant patterns excluded by the summary: <Person, hasWife, Person> <Sportist, birthDate, XMLSchema#Date> <Person, birthDate, XMLSchema#Date>
  • 11. Person Sportist FootballPlayer Lawyer Jim Brown Amal Clooney “1936-02-17” XMLSchema#Date hasWife Artist George Clooney birthDate = types = instances = literals . subclassOf subclassOf subclassOf subclassOf type type type The (minimal-type) patterns extracted by ABSTAT are: <Artist, hasWife, Lawyer> <FootballPlayer, birthDate, XMLSchema#Date> <Artist, birthDate, XMLSchema#Date> (type) An example how AKPs are extracted type type type type
  • 13. ABSTAT User Interfaces ABSTAT homepage (http://abstat.disco.unimib.it) ABSTATBrowse (http://abstat.disco.unimib.it/browse) ABSTATSearch (http://abstat.disco.unimib.it/search) SPARQL Endpoint (http://abstat.disco.unimib.it/sparql) University of Milan - Bicocca 13
  • 14. Experimental Evaluation  Summary compactness  Number of patterns in the summary vs. number of triples in the data set  Comparison with a similar approach without minimalization  Summary informativeness  Insights about the semantics of the properties  Small-scale user study
  • 15. Compactness Dataset Relational Typing Assertions Types (Ext.) Properties (Ext.) Patterns DBpedia Core 2014 40.5M 29.7M 70.1M 869 (85) 1439 (15) 171340 DBpedia 3.9 Infobox 96.3M 19.7M 116.4M 821 (58) 62572 (14) 732418 Linked Brainz 180.1M 39.6M 221.7M 21 (9) 33 (0) 161 Reduction Rate = Dataset ABSTAT LOUPE DBpedia Core 2014 0.002 0.01 Linked Brainz 6.72 10-7 7.1 10-7  Minimalization produces more compact summaries  Advantage of minimalization is more observable for datasets with richer subtype graphs and typing assertions Data sets and summaries statistics Reduction rate Number of patterns Number of assertions in the data set Similar to ABSTAT without minimalization
  • 16. Informativeness ABSTAT summaries provide useful insights about the semantics of properties, based on their usage within a data set Dataset Missing Domain (%) Missing Range (%) Missing Domain & Range (%) DBpedia Core 2014 259 (18%) 187 (13%) 48 (3.3%) DBpedia 3.9 Infobox 61368 (98%) 61309 (98%) 61161 (97%) Linked Brainz 13 (39%) 15 (45%) 13 (39%)
  • 17. Inferred domain and range for DBpedia Core 2014 0 20 40 60 80 100 120 140 160 dbo:type dbo:religion dbo:succes… dbo:predec… dbo:division dbo:picture dbo:isPartOf dbo:position dbo:series dbo:builder dbo:gender dbo:category dbo:source dbo:jurisdic… dbo:localAu… dbo:biome dbo:royalA… dbo:orogeny dbo:mainInt… dbo:authority dbo:chairLa… dbo:era dbo:format dbo:founder dbo:manag… dbo:webcast dbo:related dbo:similar dbo:hasVar… dbo:sportG… dbo:variantOf dbo:connot… dbo:named… Numberofminimaltypes Extracted minimal types (domain) Extracted minimal types (range)
  • 18. User Study: Setup Can ABSTAT be useful to support query formulation?  Queries to DBpedia 3.9 Infobox from the Questions and Answering in Linked Open Data benchmark 5 queries of increasing length (1 of length 1, 2 of length 2 and 2 of length 3)  20 participants, 2 groups: abstat group uses ABSTAT (after 20 min of training) control group does not use ABSTAT  Measures: Time needed to formulate the query Accuracy of the answer
  • 19. User Study: Questionnaire University of Milan - Bicocca 19
  • 20. User Study: Results Group Avg. Completion Time (s) Accuracy Query 1- length 1 How many employees does Google have? abstat 358.9 0.9 control 380.6 0.8 Query 2- length 2 Give me all people that were born in Vienna and died in Berlin. abstat 356.3 1 control 346.9 0.8 Query 3- length 2 Which professional surfers were born in Australia? abstat 476.6 0.6 control 234.24 0.7 Query 4- length 3 In which films directed by Gary Marshall was Julia Roberts starring? abstat 333.4 0.9 control 445.6 0.9 Query 5- length 3 Give me all books by William Goldman with more than 300 pages. abstat 233.4 1 control 569.8 0.7 The independent t-test showed that there was a significant effect between two groups for answering correctly Q5: t(16) = 10.32, p < .005
  • 21. User Study: Results Analysis  abstat group users benefit from ABSTAT summary in terms of average completion time, accuracy, or both  Increasing accuracy over increasing difficulty, performing the tasks faster  Exception is query 3, because the individual Surfing is classified with no type other than owl:Thing  Two used strategies to answer the queries by participants from the control group were:  To directly access the public web page describing the DBpedia named individuals mentioned in the query  Very few submitted explorative SPARQL queries to the endpoint
  • 22. Conclusion and Future Work  ABSTAT: ontology-driven summarization with minimalization  Sensible reduction rate and promising results about the informativeness of the summary  Currently extending the user study  Apply relevance-oriented summarization methods based on connectivity analysis  ABSTAT summary should consider the inheritance of properties to produce even more compact summaries  We envision a complete analysis of the most important data set available in the LOD cloud (20+ data sets available)  APIs available soon
  • 23. Thank you for your attention! 23University of Milan - Bicocca