SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
{RDF} Data Quality Assessment
Connecting the pieces...
Dimitris Kontokostas
Senior Knowledge Engineer
Connected Data London 2018 - Nov 7th 2018
About me...
● Data geek, software engineer & open source enthusiast
● PhD in knowledge extraction and quality assessment
● Involved in graph-related standardization activities (ShEx/SHACL)
● Author of the RDFUnit Java library
● Co-author of “Validating RDF Data” book
● Working on the GeoPhy Real Estate Knowledge Graph
Overview
● Attempt to define data quality
● Identify data quality issues
● Means for tackling them
What is
data quality?
What is
??? quality?
Quality of life...
Image credits
Quality of OS
Multidimensional
image credits
Data Quality is
Which one is better?
ex:Foo
a dbo:Person ;
dbo:birthDate ”2000-01-01”^^xsd:date .
ex:Bar
a foaf:Person ;
foaf:age 18 .
ex:Baz
wkd:p31 wk:Q5 ;
wkd:p569 ”2000-01-01”^^xsd:date .
Would you use this information for …
ex:Chickenpox
a ex:InfectiousDisease ;
ex:symptoms ”rash”, “fever”, “headache” ;
ex:treatWithVaccine ex:VaricellaVaccine .
ex:VaricellaVaccine
a ex:Vaccine ;
ex:treats ex:Chickenpox, ex:HerpesZoster .
- a visualization?
- a disease website?
- automated treatment?
Fitness for use
Data Quality is
Data Quality Dimension themes
Accessibility: accessing & retrieving data, complete or part of
Contextual: depend on the use-case context or consumer preference
Intrinsic: independent of context
Representational: related to data design
See A. Zaveri et al. Quality Assessment of Linked data a Survey
Accessibility Dimensions
Availability can you access the data?
Licence can you use the data?
Performance can you get the data in reasonable time?
See A. Zaveri et al. Quality Assessment of Linked data a Survey
Contextual Dimensions
Relevancy does it cover your needs?
Trustworthiness do you trust the publisher?
Understandability do you understand the data? Is there documentation?
Timeliness is the data stalled or up to date?
See A. Zaveri et al. Quality Assessment of Linked data a Survey
Intrinsic Dimensions
Semantically valid are there any syntax errors?
Semantically accurate are there outliers, misused labels?
Consistent are there inconsistencies?
Concise are there duplicates and/or ambiguity, NULLs?
Complete are records or values missing?
See A. Zaveri et al. Quality Assessment of Linked data a Survey
Representational Dimensions
Interoperable are terms/labels/vocabularies reused?
Interpretable is it self-descriptive?
Versatile is it provided in multiple formats / languages?
See A. Zaveri et al. Quality Assessment of Linked data a Survey
How good do you need it to get?
There is a great costs in:
> assessing the quality of dataset
> improving the quality of dataset
Costs is highly dependent on whether:
> data & assets are outside of your control
> data & assets are within your control
> data & assets are bought
More or less than what you need impacts costs and/or product
Quality Cost ($)
Where things
can go wrong
Where data
can go wrong
Where data can go wrong
Source data
Master schema
Mappings
Validation Rules
Identity Resolution
Data Fusion
Source data
can be (semi) unstructured
can be messy
cannot fit into a/the schema
Master schema
Incorrect modeling
Incomplete modeling
Inaccurate translation
> to owl, rdfs, ShEx, SHACL, etc
Undesired expressivity
> RDFS, OWL: DL/RL/FULL, etc
Mappings
Incorrect mapping
errors scale to the source size (up to millions)
Incomplete mapping
Software bugs
conversion scripts, ETL code, etc
Model sync
> port schema updates
Validation Rules
Incorrect translation
> birthDate max cardinality 1
> birthDate min cardinality 1
Syntax error & typos
> dirthDate must be xsd:date
Model sync
> port schema updates
Evolution & quality
↻
↻
↻
↻
↻
↻
See http://aligned-project.eu
Sounds good so
far… now what?
Strategies for managing quality
Data testers
> explicit / implicit roles
Crowdsourcing
> field experts vs MTurk
Executable validation rules
> SHACL, ShEx, OWL
See Acosta et al. Detecting Linked Data quality issues via crowdsourcing: A DBpedia study
kontokostas et al. TripleCheckMate: A Tool for Crowdsourcing the Quality Assessment of Linked Data / Demo
> Needs good tool support
> Generic tools missing
> Validation engines improved
Validate closer to the source of the error
↻
↻
↻
↻
↻
see Dimou et al. Assessing and Refining Mappings to RDF to Improve Dataset Quality
Kontokostas et al. Semantically Enhanced Quality Assurance in theJURION Business Use Case
> Always in the K range
> Scales with source size
> Errors scale as well
Automate, automate & automate...
ex:name
a rdf:Property ;
rdfs:range rdf:langString .
Schema.ttl
ex:Foo
a dbo:Person ;
ex:name “Foo @en” .
Data.ttl
Automate, automate & automate...
ex:name
a rdf:Property ;
rdfs:range rdf:langString .
Schema.ttl
ex:Foo
a dbo:Person ;
ex:name “Foo @en” .
ex:name “Foo”@en .
Data.ttl
CI/CD is your best friend
Treat data as code
> Jenkins, Travis, GitLab, TeamCity, ...
Trigger validation on every (single) change
> Fail the build until data issues are fixed
Create (data) integration tests
Just like in software…
> Green CI <> No Errors/bugs
> Green CI => Not enough tests
Recap
> Data quality is fitness for use
> Can be assessed with multiple dimensions
> Identify the quality you need
> Also look for errors in the schema, the rules and the mappings
> Validate closer to the error source
> Automate as much as possible
Thank you! Questions?
@jimkont
kontokostas.com
slideshare.net/jimkont

Weitere ähnliche Inhalte

Was ist angesagt?

What Factors Influence the Design of a Linked Data Generation Algorithm?
What Factors Influence the Design of a Linked Data Generation Algorithm?What Factors Influence the Design of a Linked Data Generation Algorithm?
What Factors Influence the Design of a Linked Data Generation Algorithm?andimou
 
Fried data summit data quality data analytics together
Fried data summit data quality data analytics togetherFried data summit data quality data analytics together
Fried data summit data quality data analytics togetherJeff Fried
 
Vital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent AppsVital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent AppsVital.AI
 
Scaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analyticsScaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analyticsConnected Data World
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital.AI
 
Big Data and the Semantic Web: Challenges and Opportunities
Big Data and the Semantic Web: Challenges and OpportunitiesBig Data and the Semantic Web: Challenges and Opportunities
Big Data and the Semantic Web: Challenges and OpportunitiesSrinath Srinivasa
 
Adding Rules on Existing Hypermedia APIs
Adding Rules on Existing Hypermedia APIsAdding Rules on Existing Hypermedia APIs
Adding Rules on Existing Hypermedia APIsMichael Petychakis
 
Quality aware subgraph matching over inconsistent probabilistic graph databases
Quality aware subgraph matching over inconsistent probabilistic graph databasesQuality aware subgraph matching over inconsistent probabilistic graph databases
Quality aware subgraph matching over inconsistent probabilistic graph databasesieeechennai
 
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...semanticsconference
 
Social media monitoring with ML-powered Knowledge Graph
Social media monitoring with ML-powered Knowledge GraphSocial media monitoring with ML-powered Knowledge Graph
Social media monitoring with ML-powered Knowledge GraphGraphAware
 
Graph-Powered Machine Learning
Graph-Powered Machine LearningGraph-Powered Machine Learning
Graph-Powered Machine LearningDatabricks
 
Ethics & (Explainable) AI – Semantic AI & the Role of the Knowledge Scientist
Ethics & (Explainable) AI – Semantic AI & the Role of the Knowledge ScientistEthics & (Explainable) AI – Semantic AI & the Role of the Knowledge Scientist
Ethics & (Explainable) AI – Semantic AI & the Role of the Knowledge ScientistStratos Kontopoulos
 
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...semanticsconference
 
The year of the graph: do you really need a graph database? How do you choose...
The year of the graph: do you really need a graph database? How do you choose...The year of the graph: do you really need a graph database? How do you choose...
The year of the graph: do you really need a graph database? How do you choose...George Anadiotis
 
RDF by Structured Reference to Semantics, the RS2 framework
RDF by Structured Reference to Semantics, the RS2 frameworkRDF by Structured Reference to Semantics, the RS2 framework
RDF by Structured Reference to Semantics, the RS2 frameworkKhan Mostafa
 
Integrating Relational Databases with the Semantic Web: A Reflection
Integrating Relational Databases with the Semantic Web: A ReflectionIntegrating Relational Databases with the Semantic Web: A Reflection
Integrating Relational Databases with the Semantic Web: A ReflectionJuan Sequeda
 
Iterative data discovery and transformation with open refine
Iterative data discovery and transformation with open refineIterative data discovery and transformation with open refine
Iterative data discovery and transformation with open refineMartin Magdinier
 

Was ist angesagt? (20)

What Factors Influence the Design of a Linked Data Generation Algorithm?
What Factors Influence the Design of a Linked Data Generation Algorithm?What Factors Influence the Design of a Linked Data Generation Algorithm?
What Factors Influence the Design of a Linked Data Generation Algorithm?
 
Fried data summit data quality data analytics together
Fried data summit data quality data analytics togetherFried data summit data quality data analytics together
Fried data summit data quality data analytics together
 
Semantic web an overview and projects
Semantic web   an  overview and projectsSemantic web   an  overview and projects
Semantic web an overview and projects
 
Sebastian Hellmann
Sebastian HellmannSebastian Hellmann
Sebastian Hellmann
 
Vital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent AppsVital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent Apps
 
Scaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analyticsScaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analytics
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
 
Big Data and the Semantic Web: Challenges and Opportunities
Big Data and the Semantic Web: Challenges and OpportunitiesBig Data and the Semantic Web: Challenges and Opportunities
Big Data and the Semantic Web: Challenges and Opportunities
 
Adding Rules on Existing Hypermedia APIs
Adding Rules on Existing Hypermedia APIsAdding Rules on Existing Hypermedia APIs
Adding Rules on Existing Hypermedia APIs
 
Quality aware subgraph matching over inconsistent probabilistic graph databases
Quality aware subgraph matching over inconsistent probabilistic graph databasesQuality aware subgraph matching over inconsistent probabilistic graph databases
Quality aware subgraph matching over inconsistent probabilistic graph databases
 
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
 
Social media monitoring with ML-powered Knowledge Graph
Social media monitoring with ML-powered Knowledge GraphSocial media monitoring with ML-powered Knowledge Graph
Social media monitoring with ML-powered Knowledge Graph
 
Graph-Powered Machine Learning
Graph-Powered Machine LearningGraph-Powered Machine Learning
Graph-Powered Machine Learning
 
Ethics & (Explainable) AI – Semantic AI & the Role of the Knowledge Scientist
Ethics & (Explainable) AI – Semantic AI & the Role of the Knowledge ScientistEthics & (Explainable) AI – Semantic AI & the Role of the Knowledge Scientist
Ethics & (Explainable) AI – Semantic AI & the Role of the Knowledge Scientist
 
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
 
Ethical solutions services
Ethical solutions servicesEthical solutions services
Ethical solutions services
 
The year of the graph: do you really need a graph database? How do you choose...
The year of the graph: do you really need a graph database? How do you choose...The year of the graph: do you really need a graph database? How do you choose...
The year of the graph: do you really need a graph database? How do you choose...
 
RDF by Structured Reference to Semantics, the RS2 framework
RDF by Structured Reference to Semantics, the RS2 frameworkRDF by Structured Reference to Semantics, the RS2 framework
RDF by Structured Reference to Semantics, the RS2 framework
 
Integrating Relational Databases with the Semantic Web: A Reflection
Integrating Relational Databases with the Semantic Web: A ReflectionIntegrating Relational Databases with the Semantic Web: A Reflection
Integrating Relational Databases with the Semantic Web: A Reflection
 
Iterative data discovery and transformation with open refine
Iterative data discovery and transformation with open refineIterative data discovery and transformation with open refine
Iterative data discovery and transformation with open refine
 

Ähnlich wie RDF Data Quality Assessment - connecting the pieces

Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentAmrapali Zaveri, PhD
 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratchdmurph4
 
Data Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practicesData Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practicesCarl Anderson
 
Creating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyCreating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyRTTS
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Lucidworks
 
Spark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren NathanSpark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren NathanSpark Summit
 
Data Quality
Data QualityData Quality
Data QualityVijaya K
 
Data Quality
Data QualityData Quality
Data Qualityjerdeb
 
The Data Driven University - Automating Data Governance and Stewardship in Au...
The Data Driven University - Automating Data Governance and Stewardship in Au...The Data Driven University - Automating Data Governance and Stewardship in Au...
The Data Driven University - Automating Data Governance and Stewardship in Au...Pieter De Leenheer
 
The Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphThe Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphTrey Grainger
 
Fried data summit big data for lob content
Fried data summit big data for lob contentFried data summit big data for lob content
Fried data summit big data for lob contentJeff Fried
 
Data quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data qualityData quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data qualityJaveriaGauhar
 
STC Information Topology
STC Information TopologySTC Information Topology
STC Information TopologyTyrinAvery1
 
TechChat_Making_Assessment_Happen2_Turnitin.pdf
TechChat_Making_Assessment_Happen2_Turnitin.pdfTechChat_Making_Assessment_Happen2_Turnitin.pdf
TechChat_Making_Assessment_Happen2_Turnitin.pdfdebbieholley1
 
Human vs AI Quality Raters for Search Engines.pdf
Human vs AI Quality Raters for Search Engines.pdfHuman vs AI Quality Raters for Search Engines.pdf
Human vs AI Quality Raters for Search Engines.pdfDawn Anderson MSc DigM
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Trey Grainger
 
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Lucidworks
 

Ähnlich wie RDF Data Quality Assessment - connecting the pieces (20)

Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratch
 
Data Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practicesData Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practices
 
Creating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyCreating a Data validation and Testing Strategy
Creating a Data validation and Testing Strategy
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
 
Spark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren NathanSpark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren Nathan
 
Data Quality
Data QualityData Quality
Data Quality
 
Data Quality
Data QualityData Quality
Data Quality
 
The Data Driven University - Automating Data Governance and Stewardship in Au...
The Data Driven University - Automating Data Governance and Stewardship in Au...The Data Driven University - Automating Data Governance and Stewardship in Au...
The Data Driven University - Automating Data Governance and Stewardship in Au...
 
The Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphThe Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge Graph
 
Fried data summit big data for lob content
Fried data summit big data for lob contentFried data summit big data for lob content
Fried data summit big data for lob content
 
Data quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data qualityData quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data quality
 
Measurement And Validation
Measurement And ValidationMeasurement And Validation
Measurement And Validation
 
KBART update ER&L 2009
KBART update ER&L 2009KBART update ER&L 2009
KBART update ER&L 2009
 
ER&L KBART Update
ER&L KBART UpdateER&L KBART Update
ER&L KBART Update
 
STC Information Topology
STC Information TopologySTC Information Topology
STC Information Topology
 
TechChat_Making_Assessment_Happen2_Turnitin.pdf
TechChat_Making_Assessment_Happen2_Turnitin.pdfTechChat_Making_Assessment_Happen2_Turnitin.pdf
TechChat_Making_Assessment_Happen2_Turnitin.pdf
 
Human vs AI Quality Raters for Search Engines.pdf
Human vs AI Quality Raters for Search Engines.pdfHuman vs AI Quality Raters for Search Engines.pdf
Human vs AI Quality Raters for Search Engines.pdf
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
 
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
 

Mehr von Connected Data World

Systems that learn and reason | Frank Van Harmelen
Systems that learn and reason | Frank Van HarmelenSystems that learn and reason | Frank Van Harmelen
Systems that learn and reason | Frank Van HarmelenConnected Data World
 
Graph Abstractions Matter by Ora Lassila
Graph Abstractions Matter by Ora LassilaGraph Abstractions Matter by Ora Lassila
Graph Abstractions Matter by Ora LassilaConnected Data World
 
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...Connected Data World
 
How to get started with Graph Machine Learning
How to get started with Graph Machine LearningHow to get started with Graph Machine Learning
How to get started with Graph Machine LearningConnected Data World
 
The years of the graph: The future of the future is here
The years of the graph: The future of the future is hereThe years of the graph: The future of the future is here
The years of the graph: The future of the future is hereConnected Data World
 
From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2
From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2
From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2Connected Data World
 
From Taxonomies and Schemas to Knowledge Graphs: Part 3
From Taxonomies and Schemas to Knowledge Graphs: Part 3From Taxonomies and Schemas to Knowledge Graphs: Part 3
From Taxonomies and Schemas to Knowledge Graphs: Part 3Connected Data World
 
In Search of the Universal Data Model
In Search of the Universal Data ModelIn Search of the Universal Data Model
In Search of the Universal Data ModelConnected Data World
 
Graph in Apache Cassandra. The World’s Most Scalable Graph Database
Graph in Apache Cassandra. The World’s Most Scalable Graph DatabaseGraph in Apache Cassandra. The World’s Most Scalable Graph Database
Graph in Apache Cassandra. The World’s Most Scalable Graph DatabaseConnected Data World
 
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...Connected Data World
 
Powering Question-Driven Problem Solving to Improve the Chances of Finding Ne...
Powering Question-Driven Problem Solving to Improve the Chances of Finding Ne...Powering Question-Driven Problem Solving to Improve the Chances of Finding Ne...
Powering Question-Driven Problem Solving to Improve the Chances of Finding Ne...Connected Data World
 
Semantic similarity for faster Knowledge Graph delivery at scale
Semantic similarity for faster Knowledge Graph delivery at scaleSemantic similarity for faster Knowledge Graph delivery at scale
Semantic similarity for faster Knowledge Graph delivery at scaleConnected Data World
 
Knowledge Graphs and AI to Hyper-Personalise the Fashion Retail Experience at...
Knowledge Graphs and AI to Hyper-Personalise the Fashion Retail Experience at...Knowledge Graphs and AI to Hyper-Personalise the Fashion Retail Experience at...
Knowledge Graphs and AI to Hyper-Personalise the Fashion Retail Experience at...Connected Data World
 
Schema, Google & The Future of the Web
Schema, Google & The Future of the WebSchema, Google & The Future of the Web
Schema, Google & The Future of the WebConnected Data World
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsConnected Data World
 
Elegant and Scalable Code Querying with Code Property Graphs
Elegant and Scalable Code Querying with Code Property GraphsElegant and Scalable Code Querying with Code Property Graphs
Elegant and Scalable Code Querying with Code Property GraphsConnected Data World
 
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...Connected Data World
 
Graph for Good: Empowering your NGO
Graph for Good: Empowering your NGOGraph for Good: Empowering your NGO
Graph for Good: Empowering your NGOConnected Data World
 

Mehr von Connected Data World (20)

Systems that learn and reason | Frank Van Harmelen
Systems that learn and reason | Frank Van HarmelenSystems that learn and reason | Frank Van Harmelen
Systems that learn and reason | Frank Van Harmelen
 
Graph Abstractions Matter by Ora Lassila
Graph Abstractions Matter by Ora LassilaGraph Abstractions Matter by Ora Lassila
Graph Abstractions Matter by Ora Lassila
 
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...
 
How to get started with Graph Machine Learning
How to get started with Graph Machine LearningHow to get started with Graph Machine Learning
How to get started with Graph Machine Learning
 
Graphs in sustainable finance
Graphs in sustainable financeGraphs in sustainable finance
Graphs in sustainable finance
 
The years of the graph: The future of the future is here
The years of the graph: The future of the future is hereThe years of the graph: The future of the future is here
The years of the graph: The future of the future is here
 
From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2
From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2
From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2
 
From Taxonomies and Schemas to Knowledge Graphs: Part 3
From Taxonomies and Schemas to Knowledge Graphs: Part 3From Taxonomies and Schemas to Knowledge Graphs: Part 3
From Taxonomies and Schemas to Knowledge Graphs: Part 3
 
In Search of the Universal Data Model
In Search of the Universal Data ModelIn Search of the Universal Data Model
In Search of the Universal Data Model
 
Graph in Apache Cassandra. The World’s Most Scalable Graph Database
Graph in Apache Cassandra. The World’s Most Scalable Graph DatabaseGraph in Apache Cassandra. The World’s Most Scalable Graph Database
Graph in Apache Cassandra. The World’s Most Scalable Graph Database
 
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...
 
Graph Realities
Graph RealitiesGraph Realities
Graph Realities
 
Powering Question-Driven Problem Solving to Improve the Chances of Finding Ne...
Powering Question-Driven Problem Solving to Improve the Chances of Finding Ne...Powering Question-Driven Problem Solving to Improve the Chances of Finding Ne...
Powering Question-Driven Problem Solving to Improve the Chances of Finding Ne...
 
Semantic similarity for faster Knowledge Graph delivery at scale
Semantic similarity for faster Knowledge Graph delivery at scaleSemantic similarity for faster Knowledge Graph delivery at scale
Semantic similarity for faster Knowledge Graph delivery at scale
 
Knowledge Graphs and AI to Hyper-Personalise the Fashion Retail Experience at...
Knowledge Graphs and AI to Hyper-Personalise the Fashion Retail Experience at...Knowledge Graphs and AI to Hyper-Personalise the Fashion Retail Experience at...
Knowledge Graphs and AI to Hyper-Personalise the Fashion Retail Experience at...
 
Schema, Google & The Future of the Web
Schema, Google & The Future of the WebSchema, Google & The Future of the Web
Schema, Google & The Future of the Web
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
 
Elegant and Scalable Code Querying with Code Property Graphs
Elegant and Scalable Code Querying with Code Property GraphsElegant and Scalable Code Querying with Code Property Graphs
Elegant and Scalable Code Querying with Code Property Graphs
 
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
 
Graph for Good: Empowering your NGO
Graph for Good: Empowering your NGOGraph for Good: Empowering your NGO
Graph for Good: Empowering your NGO
 

Kürzlich hochgeladen

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Kürzlich hochgeladen (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

RDF Data Quality Assessment - connecting the pieces

  • 1. {RDF} Data Quality Assessment Connecting the pieces... Dimitris Kontokostas Senior Knowledge Engineer Connected Data London 2018 - Nov 7th 2018
  • 2. About me... ● Data geek, software engineer & open source enthusiast ● PhD in knowledge extraction and quality assessment ● Involved in graph-related standardization activities (ShEx/SHACL) ● Author of the RDFUnit Java library ● Co-author of “Validating RDF Data” book ● Working on the GeoPhy Real Estate Knowledge Graph
  • 3. Overview ● Attempt to define data quality ● Identify data quality issues ● Means for tackling them
  • 9. Which one is better? ex:Foo a dbo:Person ; dbo:birthDate ”2000-01-01”^^xsd:date . ex:Bar a foaf:Person ; foaf:age 18 . ex:Baz wkd:p31 wk:Q5 ; wkd:p569 ”2000-01-01”^^xsd:date .
  • 10. Would you use this information for … ex:Chickenpox a ex:InfectiousDisease ; ex:symptoms ”rash”, “fever”, “headache” ; ex:treatWithVaccine ex:VaricellaVaccine . ex:VaricellaVaccine a ex:Vaccine ; ex:treats ex:Chickenpox, ex:HerpesZoster . - a visualization? - a disease website? - automated treatment?
  • 11. Fitness for use Data Quality is
  • 12. Data Quality Dimension themes Accessibility: accessing & retrieving data, complete or part of Contextual: depend on the use-case context or consumer preference Intrinsic: independent of context Representational: related to data design See A. Zaveri et al. Quality Assessment of Linked data a Survey
  • 13. Accessibility Dimensions Availability can you access the data? Licence can you use the data? Performance can you get the data in reasonable time? See A. Zaveri et al. Quality Assessment of Linked data a Survey
  • 14. Contextual Dimensions Relevancy does it cover your needs? Trustworthiness do you trust the publisher? Understandability do you understand the data? Is there documentation? Timeliness is the data stalled or up to date? See A. Zaveri et al. Quality Assessment of Linked data a Survey
  • 15. Intrinsic Dimensions Semantically valid are there any syntax errors? Semantically accurate are there outliers, misused labels? Consistent are there inconsistencies? Concise are there duplicates and/or ambiguity, NULLs? Complete are records or values missing? See A. Zaveri et al. Quality Assessment of Linked data a Survey
  • 16. Representational Dimensions Interoperable are terms/labels/vocabularies reused? Interpretable is it self-descriptive? Versatile is it provided in multiple formats / languages? See A. Zaveri et al. Quality Assessment of Linked data a Survey
  • 17. How good do you need it to get? There is a great costs in: > assessing the quality of dataset > improving the quality of dataset Costs is highly dependent on whether: > data & assets are outside of your control > data & assets are within your control > data & assets are bought More or less than what you need impacts costs and/or product Quality Cost ($)
  • 20. Where data can go wrong Source data Master schema Mappings Validation Rules Identity Resolution Data Fusion
  • 21. Source data can be (semi) unstructured can be messy cannot fit into a/the schema
  • 22. Master schema Incorrect modeling Incomplete modeling Inaccurate translation > to owl, rdfs, ShEx, SHACL, etc Undesired expressivity > RDFS, OWL: DL/RL/FULL, etc
  • 23. Mappings Incorrect mapping errors scale to the source size (up to millions) Incomplete mapping Software bugs conversion scripts, ETL code, etc Model sync > port schema updates
  • 24. Validation Rules Incorrect translation > birthDate max cardinality 1 > birthDate min cardinality 1 Syntax error & typos > dirthDate must be xsd:date Model sync > port schema updates
  • 27. Strategies for managing quality Data testers > explicit / implicit roles Crowdsourcing > field experts vs MTurk Executable validation rules > SHACL, ShEx, OWL See Acosta et al. Detecting Linked Data quality issues via crowdsourcing: A DBpedia study kontokostas et al. TripleCheckMate: A Tool for Crowdsourcing the Quality Assessment of Linked Data / Demo > Needs good tool support > Generic tools missing > Validation engines improved
  • 28. Validate closer to the source of the error ↻ ↻ ↻ ↻ ↻ see Dimou et al. Assessing and Refining Mappings to RDF to Improve Dataset Quality Kontokostas et al. Semantically Enhanced Quality Assurance in theJURION Business Use Case > Always in the K range > Scales with source size > Errors scale as well
  • 29. Automate, automate & automate... ex:name a rdf:Property ; rdfs:range rdf:langString . Schema.ttl ex:Foo a dbo:Person ; ex:name “Foo @en” . Data.ttl
  • 30. Automate, automate & automate... ex:name a rdf:Property ; rdfs:range rdf:langString . Schema.ttl ex:Foo a dbo:Person ; ex:name “Foo @en” . ex:name “Foo”@en . Data.ttl
  • 31. CI/CD is your best friend Treat data as code > Jenkins, Travis, GitLab, TeamCity, ... Trigger validation on every (single) change > Fail the build until data issues are fixed Create (data) integration tests Just like in software… > Green CI <> No Errors/bugs > Green CI => Not enough tests
  • 32. Recap > Data quality is fitness for use > Can be assessed with multiple dimensions > Identify the quality you need > Also look for errors in the schema, the rules and the mappings > Validate closer to the error source > Automate as much as possible