RDF and graph databases are steadily increasing their adoption and are no longer choices of niche-only communities. For almost 20 years, a constraint language for RDF was a big missing piece in the technology stack and a prohibiting factor for further adoption.
Even though most RDF-based systems were performing data validation and quality assessment, there was no standardized way to define constraints. People were using ad-hoc solutions or schemas and languages that were not meant for validation.
Thankfully, since 2017 there are 2 additions to the RDF technology stack: SHACL & ShEx. Both provide a high level RDF constraint language that people can use to define data constraints (a.k.a. Shapes), each with different strengths.
This talk provides an outline of different types of RDF data quality issues and existing approaches to quality assessment. The goal is to give an overview of the existing RDF validation landscape and hopefully, inspire people on how to improve their RDF publishing workflows.
TeamStation AI System Report LATAM IT Salaries 2024
RDF Data Quality Assessment - connecting the pieces
1. {RDF} Data Quality Assessment
Connecting the pieces...
Dimitris Kontokostas
Senior Knowledge Engineer
Connected Data London 2018 - Nov 7th 2018
2. About me...
● Data geek, software engineer & open source enthusiast
● PhD in knowledge extraction and quality assessment
● Involved in graph-related standardization activities (ShEx/SHACL)
● Author of the RDFUnit Java library
● Co-author of “Validating RDF Data” book
● Working on the GeoPhy Real Estate Knowledge Graph
3. Overview
● Attempt to define data quality
● Identify data quality issues
● Means for tackling them
9. Which one is better?
ex:Foo
a dbo:Person ;
dbo:birthDate ”2000-01-01”^^xsd:date .
ex:Bar
a foaf:Person ;
foaf:age 18 .
ex:Baz
wkd:p31 wk:Q5 ;
wkd:p569 ”2000-01-01”^^xsd:date .
10. Would you use this information for …
ex:Chickenpox
a ex:InfectiousDisease ;
ex:symptoms ”rash”, “fever”, “headache” ;
ex:treatWithVaccine ex:VaricellaVaccine .
ex:VaricellaVaccine
a ex:Vaccine ;
ex:treats ex:Chickenpox, ex:HerpesZoster .
- a visualization?
- a disease website?
- automated treatment?
12. Data Quality Dimension themes
Accessibility: accessing & retrieving data, complete or part of
Contextual: depend on the use-case context or consumer preference
Intrinsic: independent of context
Representational: related to data design
See A. Zaveri et al. Quality Assessment of Linked data a Survey
13. Accessibility Dimensions
Availability can you access the data?
Licence can you use the data?
Performance can you get the data in reasonable time?
See A. Zaveri et al. Quality Assessment of Linked data a Survey
14. Contextual Dimensions
Relevancy does it cover your needs?
Trustworthiness do you trust the publisher?
Understandability do you understand the data? Is there documentation?
Timeliness is the data stalled or up to date?
See A. Zaveri et al. Quality Assessment of Linked data a Survey
15. Intrinsic Dimensions
Semantically valid are there any syntax errors?
Semantically accurate are there outliers, misused labels?
Consistent are there inconsistencies?
Concise are there duplicates and/or ambiguity, NULLs?
Complete are records or values missing?
See A. Zaveri et al. Quality Assessment of Linked data a Survey
16. Representational Dimensions
Interoperable are terms/labels/vocabularies reused?
Interpretable is it self-descriptive?
Versatile is it provided in multiple formats / languages?
See A. Zaveri et al. Quality Assessment of Linked data a Survey
17. How good do you need it to get?
There is a great costs in:
> assessing the quality of dataset
> improving the quality of dataset
Costs is highly dependent on whether:
> data & assets are outside of your control
> data & assets are within your control
> data & assets are bought
More or less than what you need impacts costs and/or product
Quality Cost ($)
23. Mappings
Incorrect mapping
errors scale to the source size (up to millions)
Incomplete mapping
Software bugs
conversion scripts, ETL code, etc
Model sync
> port schema updates
24. Validation Rules
Incorrect translation
> birthDate max cardinality 1
> birthDate min cardinality 1
Syntax error & typos
> dirthDate must be xsd:date
Model sync
> port schema updates
27. Strategies for managing quality
Data testers
> explicit / implicit roles
Crowdsourcing
> field experts vs MTurk
Executable validation rules
> SHACL, ShEx, OWL
See Acosta et al. Detecting Linked Data quality issues via crowdsourcing: A DBpedia study
kontokostas et al. TripleCheckMate: A Tool for Crowdsourcing the Quality Assessment of Linked Data / Demo
> Needs good tool support
> Generic tools missing
> Validation engines improved
28. Validate closer to the source of the error
↻
↻
↻
↻
↻
see Dimou et al. Assessing and Refining Mappings to RDF to Improve Dataset Quality
Kontokostas et al. Semantically Enhanced Quality Assurance in theJURION Business Use Case
> Always in the K range
> Scales with source size
> Errors scale as well
31. CI/CD is your best friend
Treat data as code
> Jenkins, Travis, GitLab, TeamCity, ...
Trigger validation on every (single) change
> Fail the build until data issues are fixed
Create (data) integration tests
Just like in software…
> Green CI <> No Errors/bugs
> Green CI => Not enough tests
32. Recap
> Data quality is fitness for use
> Can be assessed with multiple dimensions
> Identify the quality you need
> Also look for errors in the schema, the rules and the mappings
> Validate closer to the error source
> Automate as much as possible