SlideShare a Scribd company logo
1 of 32
Download to read offline
www.kit.edu
@Data Quality Tutorial, September 12, 2016
Crowdsourcing Linked Data Quality Assessment
Amrapali Zaveri
Linked Data - over billion facts
What about the quality?
Motivation - Linked Data Quality
Varying quality of Linked Data sources
Source, Extraction, Integration etc.
Some quality issues require certain interpretation
that can be easily performed by humans
Incompleteness
Incorrectness
Semantic Accuracy
Motivation - Linked Data Quality
Solution: Include human verification in the
process of LD quality assessment via
crowdsourcing
Human Intelligent Tasks (HIT)
Labor market
Monetary Reward/Incentive
Time & Cost effective
Large-scale problem solving approach, divided into
smaller tasks, independently solved by a large
group of people.
Research questions
RQ1: Is it possible to detect quality issues in LD data sets
via crowdsourcing mechanisms?
RQ2: What type of crowd is most suitable for each type of
quality issue?
RQ3: Which types of errors are made by lay users and
experts when assessing RDF triples?
Related work
Crowdsourcing
& Linked Data
Web of data
quality
assessment
Our work
ZenCrowd
Entity resolution
CrowdMAP
Ontology alignment
GWAP for LD
Assessing LD
mappings
(Automatic)
Quality
characteristics of
LD data sources
(Semi-automatic)
DBpedia
WIQA, Sieve,
(Manual)
OUR APPROACH
Find-Verify Phases of Crowdsourcing
Contest
LD Experts
Difficult task
Final prize
Find Verify
Microtasks
Workers
Easy task
Micropayments
TripleCheckMate
[Kontoskostas2013] MTurk
(1) Adapted from [Bernstein2010]
http://mturk.com
LD Experts AMT Workers
Type Contest-based Human Intelligent Tasks
(HITs)
Participants Linked Data (LD) experts Labor market
Task Detect and classify
quality issues in
resources
Detect quality issues in
triples
Reward Most no. of resources
evaluated
Per task/triple
Tool TripleCheckMate Amazon Mechanical Turk,
CrowdFlower etc.
Difference between LD experts & Workers
Methodology
Crowdsource using:
• Linked Data experts - Find Phase
• Amazon Mechanical Turk workers - Verify Phase
Crowdsourcing using Linked Data
Experts — Methodology
Phase I: Creation of quality problem taxonomy
Phase II: Launching a contest
Zaveri et. al. Quality assessment methodologies for Linked Open Data. Semantic
Web Journal, 2015.
D = Detectable, F = Fixable, E = Extraction Framework, M = Mappings Wiki
Crowdsourcing using Linked Data
Experts — Quality Problem Taxonomy
D = Detectable, F = Fixable, E = Extraction Framework, M = Mappings Wiki
Crowdsourcing using Linked Data
Experts — Quality Problem Taxonomy
D = Detectable, F = Fixable, E = Extraction Framework, M = Mappings Wiki
http://nl.dbpedia.org:8080/TripleCheckMate-Demo/
Crowdsourcing using Linked Data
Experts — Contest
Crowdsourcing using Linked Data
Experts — Results
Methodology
Crowdsource using:
• Linked Data experts - Find Phase
• Amazon Mechanical Turk workers - Verify Phase
Crowdsourcing using AMT Workers
Selecting LD quality issues to crowdsource
Designing and generating the micro tasks to present the
data to the crowd
1
2
Dataset
{s p o .}
{s p o .}
Correct
Incorrect +
Quality issue
Steps:
1
2
3
Three categories of quality problems occur
pervasively in DBpedia and can be crowdsourced:
Incorrect/Incomplete object
▪Example: dbpedia:Dave_Dobbyn	dbprop:dateOfBirth	“3”.
Incorrect data type or language tags
▪Example: dbpedia:Torishima_Izu_Islands	foaf:name	“ ”@en.
Incorrect link to “external Web pages”
▪Example: dbpedia:John-Two-Hawks	dbpedia-owl:wikiPageExternalLink			
														<http://cedarlakedvd.com/>
Selecting LD quality issues to
crowdsource
1
Presenting the data to the crowd
• Selection of foaf:name or
rdfs:label to extract human-
readable descriptions
• Values extracted automatically
from Wikipedia infoboxes
• Link to the Wikipedia article via
foaf:isPrimaryTopicOf
• Preview of external pages by
implementing HTML iframe
Microtask interfaces: MTurk tasks
Incorrect object
Incorrect data type or language tag
Incorrect outlink
2
EXPERIMENTAL STUDY
Experimental design
• Crowdsourcing approaches:
• Find stage: Contest with LD experts
• Verify stage: Microtasks
• Creation of a gold standard:
• Two of the authors of this paper (MA, AZ) generated the gold
standard for all the triples obtained from the contest
• Each author independently evaluated the triples
• Conflicts were resolved via mutual agreement
• Metric: precision
Overall results
LD Experts Microtask workers
Number of distinct
participants
50 80
Total time
3 weeks (predefined) 4 days
Total triples evaluated
1,512 1,073
Total cost
~ US$ 400 (predefined) ~ US$ 43
Precision results: Incorrect object task
• MTurk workers can be used to reduce the error rates of LD experts for
the Find stage
• 117 DBpedia triples had predicates related to dates with incorrect/
incomplete values:
”2005 Six Nations Championship” Date 12 .
• 52 DBpedia triples had erroneous values from the source:
”English (programming language)” Influenced by ? .
• Experts classified all these triples as incorrect
• Workers compared values against Wikipedia and successfully classified
this triples as “correct”
Triples compared LD Experts MTurk
(majority voting: n=5)
509 0.7151 0.8977
=
Precision results: Incorrect data type task
Numberoftriples
0
38
75
113
150
Data types
Date Millimetre Number Second Year
Experts TP
Experts FP
Crowd TP
Crowd FP
Triples compared LD Experts MTurk
(majority voting: n=5)
341 0.8270 0.4752
Precision results: Incorrect link task
• We analyzed the 189 misclassifications by the experts:
• The misclassifications by the workers correspond to pages
with a language different from English.
11%
39%
50%
Freebase links
Wikipedia images
External links
Triples compared Baseline LD Experts MTurk
(n=5 majority voting)
223 0.2598 0.1525 0.9412
Final discussion
RQ1: Is it possible to detect quality issues in LD data sets via crowdsourcing
mechanisms?
LD experts - incorrect datatype
AMT workers - incorrect/incomplete object value, incorrect
interlink
RQ2: What type of crowd is most suitable for each type of quality issue?
The effort of LD experts must be applied on those tasks demanding
specific-domain skills. AMT workers were exceptionally good at
performing data comparisons
RQ3: Which types of errors are made by lay users and experts?
Lay users do not have the skills to solve domain-specific tasks, while
experts performance is very low on tasks that demand an extra effort
(e.g., checking an external page)
CONCLUSIONS, CHALLENGES &
FUTURE WORK
Conclusions
A crowdsourcing methodology for LD quality assessment:
Find stage: LD experts
Verify stage: AMT workers
Methodology and tool are generic to be applied to other
scenarios
Crowdsourcing approaches are feasible in detecting the
studied quality issues
Challenges
Lack of gold-standard
Crowdsourcing design — how many workers? how
many tasks? reward?
Microtask design
Future Work
Combining semi-automated and crowdsourcing methods
Predicted vs. crowdsourced metadata
Conducting new experiments (other domains)
Entity, dataset, experimental metadata
Fix/Improve Quality using Crowdsourcing
Find-Fix-Verify Phases
References
Triplecheckmate: A tool for crowdsourcing the quality assessment of
linked data. D Kontokostas, A Zaveri, S Auer, J Lehmann, ISWC 2013.
Crowdsourcing linked data quality assessment. M Acosta, A Zaveri, E
Simperl, D Kontokostas, S Auer, J Lehmann, ISWC 2013.
User-driven quality evaluation of DBpedia. A Zaveri, D Kontokostas, MA
Sherif, L Bühmann, M Morsey, S Auer, J Lehmann, ISEMANTiCS 2013.
Quality assessment for linked data: A survey. A Zaveri, A Rula, A
Maurino, R Pietrobon, J Lehmann, S Auer, Semantic Web Journal 2015.
Detecting Linked Data Quality Issues via Crowdsourcing: A DBpedia
Study. M Acosta, A Zaveri, E Simperl, D Kontokostas, F Flöck, J
Lehmann, Semantic Web Journal 2016.
ACRyLIQ: Leveraging DBpedia for Adaptive Crowdsourcing in Linked
Data Quality Assessment. U Hassan, A Zaveri, E Marx, E Curry, J
Lehmann. EKAW 2016.
Thank You
Questions?
amrapali@stanford.edu

@AmrapaliZ

More Related Content

What's hot

A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...Nandana Mihindukulasooriya
 
ISWC 2014 Tutorial - Instance Matching Benchmarks for Linked Data
ISWC 2014 Tutorial - Instance Matching Benchmarks for Linked DataISWC 2014 Tutorial - Instance Matching Benchmarks for Linked Data
ISWC 2014 Tutorial - Instance Matching Benchmarks for Linked DataEvangelia Daskalaki
 
Haystack keynote 2019: What is Search Relevance? - Max Irwin
Haystack keynote 2019: What is Search Relevance? - Max IrwinHaystack keynote 2019: What is Search Relevance? - Max Irwin
Haystack keynote 2019: What is Search Relevance? - Max IrwinOpenSource Connections
 
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...Anastasija Nikiforova
 
Access Lab 2020: Context aware unified institutional knowledge services
Access Lab 2020: Context aware unified institutional knowledge servicesAccess Lab 2020: Context aware unified institutional knowledge services
Access Lab 2020: Context aware unified institutional knowledge servicesOpenAthens
 
Recruiting Study Participants Online using Amazon's Mechanical Turk
Recruiting Study Participants Online using Amazon's Mechanical TurkRecruiting Study Participants Online using Amazon's Mechanical Turk
Recruiting Study Participants Online using Amazon's Mechanical TurkSC CTSI at USC and CHLA
 
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...SC CTSI at USC and CHLA
 
AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...
AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...
AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...Anastasija Nikiforova
 
Metadata Quality Assurance
Metadata Quality AssuranceMetadata Quality Assurance
Metadata Quality AssurancePéter Király
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodKarry Lu
 
Using the Micropublications ontology and the Open Annotation Data Model to re...
Using the Micropublications ontology and the Open Annotation Data Model to re...Using the Micropublications ontology and the Open Annotation Data Model to re...
Using the Micropublications ontology and the Open Annotation Data Model to re...jodischneider
 
SocialCite makes its debut at the HighWire Press meeting
SocialCite makes its debut at the HighWire Press meetingSocialCite makes its debut at the HighWire Press meeting
SocialCite makes its debut at the HighWire Press meetingKent Anderson
 
Data Discovery and Visualization
Data Discovery and VisualizationData Discovery and Visualization
Data Discovery and VisualizationDr. Neil Brittliff
 
Metadata Quality Assurance Part II. The implementation begins
Metadata Quality Assurance Part II. The implementation beginsMetadata Quality Assurance Part II. The implementation begins
Metadata Quality Assurance Part II. The implementation beginsPéter Király
 
Lec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrustLec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrustMenchita Falcutila Dumlao
 
Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...
Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...
Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...Kathmandu Living Labs
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerOpenSource Connections
 
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...GUANGYUAN PIAO
 

What's hot (20)

A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
 
ISWC 2014 Tutorial - Instance Matching Benchmarks for Linked Data
ISWC 2014 Tutorial - Instance Matching Benchmarks for Linked DataISWC 2014 Tutorial - Instance Matching Benchmarks for Linked Data
ISWC 2014 Tutorial - Instance Matching Benchmarks for Linked Data
 
Haystack keynote 2019: What is Search Relevance? - Max Irwin
Haystack keynote 2019: What is Search Relevance? - Max IrwinHaystack keynote 2019: What is Search Relevance? - Max Irwin
Haystack keynote 2019: What is Search Relevance? - Max Irwin
 
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
 
Access Lab 2020: Context aware unified institutional knowledge services
Access Lab 2020: Context aware unified institutional knowledge servicesAccess Lab 2020: Context aware unified institutional knowledge services
Access Lab 2020: Context aware unified institutional knowledge services
 
Recruiting Study Participants Online using Amazon's Mechanical Turk
Recruiting Study Participants Online using Amazon's Mechanical TurkRecruiting Study Participants Online using Amazon's Mechanical Turk
Recruiting Study Participants Online using Amazon's Mechanical Turk
 
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...
 
AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...
AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...
AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...
 
Metadata Quality Assurance
Metadata Quality AssuranceMetadata Quality Assurance
Metadata Quality Assurance
 
Sanderson Shout It Out: LOUD
Sanderson Shout It Out: LOUDSanderson Shout It Out: LOUD
Sanderson Shout It Out: LOUD
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
 
Using the Micropublications ontology and the Open Annotation Data Model to re...
Using the Micropublications ontology and the Open Annotation Data Model to re...Using the Micropublications ontology and the Open Annotation Data Model to re...
Using the Micropublications ontology and the Open Annotation Data Model to re...
 
SocialCite makes its debut at the HighWire Press meeting
SocialCite makes its debut at the HighWire Press meetingSocialCite makes its debut at the HighWire Press meeting
SocialCite makes its debut at the HighWire Press meeting
 
Data Discovery and Visualization
Data Discovery and VisualizationData Discovery and Visualization
Data Discovery and Visualization
 
IFITT PhD Seminar 2015. Text Mining Ideas & Examples
IFITT PhD Seminar 2015. Text Mining Ideas & ExamplesIFITT PhD Seminar 2015. Text Mining Ideas & Examples
IFITT PhD Seminar 2015. Text Mining Ideas & Examples
 
Metadata Quality Assurance Part II. The implementation begins
Metadata Quality Assurance Part II. The implementation beginsMetadata Quality Assurance Part II. The implementation begins
Metadata Quality Assurance Part II. The implementation begins
 
Lec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrustLec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrust
 
Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...
Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...
Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
 
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
 

Viewers also liked

Managing Completeness of Web Data
Managing Completeness of Web DataManaging Completeness of Web Data
Managing Completeness of Web DataFariz Darari
 
Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling ...
Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling ...Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling ...
Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling ...JohannWanja
 
A few metrics about Open Data in the cultural sector
A few metrics about Open Data in the cultural sectorA few metrics about Open Data in the cultural sector
A few metrics about Open Data in the cultural sectorJoris Pekel
 
Publishing Linked Open Data in 15 minutes
Publishing Linked Open Data in 15 minutesPublishing Linked Open Data in 15 minutes
Publishing Linked Open Data in 15 minutesAlvaro Graves
 
Asug Gov Sig Data Quality Metrics Report Sapphire 2008
Asug Gov Sig Data Quality Metrics Report Sapphire 2008Asug Gov Sig Data Quality Metrics Report Sapphire 2008
Asug Gov Sig Data Quality Metrics Report Sapphire 2008nnorthrup
 
Enhancing educational data quality in heterogeneous learning contexts using p...
Enhancing educational data quality in heterogeneous learning contexts using p...Enhancing educational data quality in heterogeneous learning contexts using p...
Enhancing educational data quality in heterogeneous learning contexts using p...Alex Rayón Jerez
 
Linked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and LuzzuLinked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and Luzzujerdeb
 
Leveraging DBpedia for Adaptive Crowdsourcing in Linked Data Quality Assessment
Leveraging DBpedia for Adaptive Crowdsourcing in Linked Data Quality AssessmentLeveraging DBpedia for Adaptive Crowdsourcing in Linked Data Quality Assessment
Leveraging DBpedia for Adaptive Crowdsourcing in Linked Data Quality AssessmentUmair ul Hassan
 
Introduction to open data quality et
Introduction to open data quality etIntroduction to open data quality et
Introduction to open data quality etOpen Data Support
 
An introduction to Linked (Open) Data
An introduction to Linked (Open) DataAn introduction to Linked (Open) Data
An introduction to Linked (Open) DataAli Khalili
 
Overview of Open Data, Linked Data and Web Science
Overview of Open Data, Linked Data and Web ScienceOverview of Open Data, Linked Data and Web Science
Overview of Open Data, Linked Data and Web ScienceHaklae Kim
 
Linked Open Data Principles, Technologies and Examples
Linked Open Data Principles, Technologies and ExamplesLinked Open Data Principles, Technologies and Examples
Linked Open Data Principles, Technologies and ExamplesOpen Data Support
 
Evaluating the possibilities of DataCite for developing 'Open data metrics' o...
Evaluating the possibilities of DataCite for developing 'Open data metrics' o...Evaluating the possibilities of DataCite for developing 'Open data metrics' o...
Evaluating the possibilities of DataCite for developing 'Open data metrics' o...Nicolas Robinson-Garcia
 
Llinked open data training for EU institutions
Llinked open data training for EU institutionsLlinked open data training for EU institutions
Llinked open data training for EU institutionsOpen Data Support
 
Phil Ritchie | Putting Standards into Action: Multilingual and Semantic Enric...
Phil Ritchie | Putting Standards into Action: Multilingual and Semantic Enric...Phil Ritchie | Putting Standards into Action: Multilingual and Semantic Enric...
Phil Ritchie | Putting Standards into Action: Multilingual and Semantic Enric...semanticsconference
 
OWL-based validation by Gavin Mendel Gleasonand Bojan Bozic, Trinity College,...
OWL-based validation by Gavin Mendel Gleasonand Bojan Bozic, Trinity College,...OWL-based validation by Gavin Mendel Gleasonand Bojan Bozic, Trinity College,...
OWL-based validation by Gavin Mendel Gleasonand Bojan Bozic, Trinity College,...semanticsconference
 
Bio2RDF : A biological knowledge base for the Semantic Web
Bio2RDF : A biological knowledge base for the Semantic WebBio2RDF : A biological knowledge base for the Semantic Web
Bio2RDF : A biological knowledge base for the Semantic WebMichel Dumontier
 
Generating Biomedical Hypotheses Using Semantic Web Technologies
Generating Biomedical Hypotheses Using Semantic Web TechnologiesGenerating Biomedical Hypotheses Using Semantic Web Technologies
Generating Biomedical Hypotheses Using Semantic Web TechnologiesMichel Dumontier
 

Viewers also liked (20)

Data Science for the Win
Data Science for the WinData Science for the Win
Data Science for the Win
 
Managing Completeness of Web Data
Managing Completeness of Web DataManaging Completeness of Web Data
Managing Completeness of Web Data
 
Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling ...
Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling ...Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling ...
Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling ...
 
A few metrics about Open Data in the cultural sector
A few metrics about Open Data in the cultural sectorA few metrics about Open Data in the cultural sector
A few metrics about Open Data in the cultural sector
 
Publishing Linked Open Data in 15 minutes
Publishing Linked Open Data in 15 minutesPublishing Linked Open Data in 15 minutes
Publishing Linked Open Data in 15 minutes
 
Asug Gov Sig Data Quality Metrics Report Sapphire 2008
Asug Gov Sig Data Quality Metrics Report Sapphire 2008Asug Gov Sig Data Quality Metrics Report Sapphire 2008
Asug Gov Sig Data Quality Metrics Report Sapphire 2008
 
Enhancing educational data quality in heterogeneous learning contexts using p...
Enhancing educational data quality in heterogeneous learning contexts using p...Enhancing educational data quality in heterogeneous learning contexts using p...
Enhancing educational data quality in heterogeneous learning contexts using p...
 
Linked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and LuzzuLinked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and Luzzu
 
Leveraging DBpedia for Adaptive Crowdsourcing in Linked Data Quality Assessment
Leveraging DBpedia for Adaptive Crowdsourcing in Linked Data Quality AssessmentLeveraging DBpedia for Adaptive Crowdsourcing in Linked Data Quality Assessment
Leveraging DBpedia for Adaptive Crowdsourcing in Linked Data Quality Assessment
 
Introduction to open data quality et
Introduction to open data quality etIntroduction to open data quality et
Introduction to open data quality et
 
An introduction to Linked (Open) Data
An introduction to Linked (Open) DataAn introduction to Linked (Open) Data
An introduction to Linked (Open) Data
 
Open data quality
Open data qualityOpen data quality
Open data quality
 
Overview of Open Data, Linked Data and Web Science
Overview of Open Data, Linked Data and Web ScienceOverview of Open Data, Linked Data and Web Science
Overview of Open Data, Linked Data and Web Science
 
Linked Open Data Principles, Technologies and Examples
Linked Open Data Principles, Technologies and ExamplesLinked Open Data Principles, Technologies and Examples
Linked Open Data Principles, Technologies and Examples
 
Evaluating the possibilities of DataCite for developing 'Open data metrics' o...
Evaluating the possibilities of DataCite for developing 'Open data metrics' o...Evaluating the possibilities of DataCite for developing 'Open data metrics' o...
Evaluating the possibilities of DataCite for developing 'Open data metrics' o...
 
Llinked open data training for EU institutions
Llinked open data training for EU institutionsLlinked open data training for EU institutions
Llinked open data training for EU institutions
 
Phil Ritchie | Putting Standards into Action: Multilingual and Semantic Enric...
Phil Ritchie | Putting Standards into Action: Multilingual and Semantic Enric...Phil Ritchie | Putting Standards into Action: Multilingual and Semantic Enric...
Phil Ritchie | Putting Standards into Action: Multilingual and Semantic Enric...
 
OWL-based validation by Gavin Mendel Gleasonand Bojan Bozic, Trinity College,...
OWL-based validation by Gavin Mendel Gleasonand Bojan Bozic, Trinity College,...OWL-based validation by Gavin Mendel Gleasonand Bojan Bozic, Trinity College,...
OWL-based validation by Gavin Mendel Gleasonand Bojan Bozic, Trinity College,...
 
Bio2RDF : A biological knowledge base for the Semantic Web
Bio2RDF : A biological knowledge base for the Semantic WebBio2RDF : A biological knowledge base for the Semantic Web
Bio2RDF : A biological knowledge base for the Semantic Web
 
Generating Biomedical Hypotheses Using Semantic Web Technologies
Generating Biomedical Hypotheses Using Semantic Web TechnologiesGenerating Biomedical Hypotheses Using Semantic Web Technologies
Generating Biomedical Hypotheses Using Semantic Web Technologies
 

Similar to Detecting Data Quality Issues via Crowdsourcing

Crowdsourcing the Semantic Web
Crowdsourcing the Semantic WebCrowdsourcing the Semantic Web
Crowdsourcing the Semantic WebElena Simperl
 
RDF Data Quality Assessment - connecting the pieces
RDF Data Quality Assessment - connecting the piecesRDF Data Quality Assessment - connecting the pieces
RDF Data Quality Assessment - connecting the piecesConnected Data World
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Trey Grainger
 
How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?andrea huang
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2Roger Barga
 
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Hima Patel
 
Data Quality
Data QualityData Quality
Data Qualityjerdeb
 
Thesis Presentation
Thesis PresentationThesis Presentation
Thesis Presentationnirvdrum
 
ALIGNED Data Curation Methods and Tools
ALIGNED Data Curation Methods and ToolsALIGNED Data Curation Methods and Tools
ALIGNED Data Curation Methods and ToolsAlignedProject
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxsumitkumar600840
 
Best Practices in Recommender System Challenges
Best Practices in Recommender System ChallengesBest Practices in Recommender System Challenges
Best Practices in Recommender System ChallengesAlan Said
 
LinkedUp - Linked Data & Education
LinkedUp - Linked Data & EducationLinkedUp - Linked Data & Education
LinkedUp - Linked Data & EducationStefan Dietze
 
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015Ioan Toma
 
PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.Giuseppe Ricci
 
Metadata quality Assurance Framework at QQML2016 - short
Metadata quality Assurance Framework at QQML2016 - shortMetadata quality Assurance Framework at QQML2016 - short
Metadata quality Assurance Framework at QQML2016 - shortPéter Király
 
Imtiaz khan data_science_analytics
Imtiaz khan data_science_analyticsImtiaz khan data_science_analytics
Imtiaz khan data_science_analyticsimtiaz khan
 
Data Science Workshop - day 2
Data Science Workshop - day 2Data Science Workshop - day 2
Data Science Workshop - day 2Aseel Addawood
 
acmsigtalkshare-121023190142-phpapp01.pptx
acmsigtalkshare-121023190142-phpapp01.pptxacmsigtalkshare-121023190142-phpapp01.pptx
acmsigtalkshare-121023190142-phpapp01.pptxdongchangim30
 

Similar to Detecting Data Quality Issues via Crowdsourcing (20)

Crowdsourcing the Semantic Web
Crowdsourcing the Semantic WebCrowdsourcing the Semantic Web
Crowdsourcing the Semantic Web
 
RDF Data Quality Assessment - connecting the pieces
RDF Data Quality Assessment - connecting the piecesRDF Data Quality Assessment - connecting the pieces
RDF Data Quality Assessment - connecting the pieces
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
 
How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
 
Data Quality
Data QualityData Quality
Data Quality
 
Thesis Presentation
Thesis PresentationThesis Presentation
Thesis Presentation
 
My experiment
My experimentMy experiment
My experiment
 
ALIGNED Data Curation Methods and Tools
ALIGNED Data Curation Methods and ToolsALIGNED Data Curation Methods and Tools
ALIGNED Data Curation Methods and Tools
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
 
Best Practices in Recommender System Challenges
Best Practices in Recommender System ChallengesBest Practices in Recommender System Challenges
Best Practices in Recommender System Challenges
 
LinkedUp - Linked Data & Education
LinkedUp - Linked Data & EducationLinkedUp - Linked Data & Education
LinkedUp - Linked Data & Education
 
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
 
PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.
 
Metadata quality Assurance Framework at QQML2016 - short
Metadata quality Assurance Framework at QQML2016 - shortMetadata quality Assurance Framework at QQML2016 - short
Metadata quality Assurance Framework at QQML2016 - short
 
Imtiaz khan data_science_analytics
Imtiaz khan data_science_analyticsImtiaz khan data_science_analytics
Imtiaz khan data_science_analytics
 
Data Science Workshop - day 2
Data Science Workshop - day 2Data Science Workshop - day 2
Data Science Workshop - day 2
 
acmsigtalkshare-121023190142-phpapp01.pptx
acmsigtalkshare-121023190142-phpapp01.pptxacmsigtalkshare-121023190142-phpapp01.pptx
acmsigtalkshare-121023190142-phpapp01.pptx
 

More from Amrapali Zaveri, PhD

CrowdED: Guideline for optimal Crowdsourcing Experimental Design
CrowdED: Guideline for optimal Crowdsourcing Experimental DesignCrowdED: Guideline for optimal Crowdsourcing Experimental Design
CrowdED: Guideline for optimal Crowdsourcing Experimental DesignAmrapali Zaveri, PhD
 
MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment
MetaCrowd: Crowdsourcing Gene Expression Metadata Quality AssessmentMetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment
MetaCrowd: Crowdsourcing Gene Expression Metadata Quality AssessmentAmrapali Zaveri, PhD
 
smartAPI: Towards a more intelligent network of Web APIs
smartAPI: Towards a more intelligent network of Web APIssmartAPI: Towards a more intelligent network of Web APIs
smartAPI: Towards a more intelligent network of Web APIsAmrapali Zaveri, PhD
 
Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
Towards Biomedical Data Integration for Analyzing the Evolution of CognitionTowards Biomedical Data Integration for Analyzing the Evolution of Cognition
Towards Biomedical Data Integration for Analyzing the Evolution of CognitionAmrapali Zaveri, PhD
 
User-driven Quality Evaluation of DBpedia
User-driven Quality Evaluation of DBpediaUser-driven Quality Evaluation of DBpedia
User-driven Quality Evaluation of DBpediaAmrapali Zaveri, PhD
 

More from Amrapali Zaveri, PhD (12)

ESOF Panel 2018
ESOF Panel 2018ESOF Panel 2018
ESOF Panel 2018
 
CrowdED: Guideline for optimal Crowdsourcing Experimental Design
CrowdED: Guideline for optimal Crowdsourcing Experimental DesignCrowdED: Guideline for optimal Crowdsourcing Experimental Design
CrowdED: Guideline for optimal Crowdsourcing Experimental Design
 
MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment
MetaCrowd: Crowdsourcing Gene Expression Metadata Quality AssessmentMetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment
MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment
 
smartAPI: Towards a more intelligent network of Web APIs
smartAPI: Towards a more intelligent network of Web APIssmartAPI: Towards a more intelligent network of Web APIs
smartAPI: Towards a more intelligent network of Web APIs
 
Introduction to Bio SPARQL
Introduction to Bio SPARQL Introduction to Bio SPARQL
Introduction to Bio SPARQL
 
LDQ 2014 DQ Methodology
LDQ 2014 DQ MethodologyLDQ 2014 DQ Methodology
LDQ 2014 DQ Methodology
 
LOD-SEM
LOD-SEMLOD-SEM
LOD-SEM
 
TripleCheckMate
TripleCheckMateTripleCheckMate
TripleCheckMate
 
Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
Towards Biomedical Data Integration for Analyzing the Evolution of CognitionTowards Biomedical Data Integration for Analyzing the Evolution of Cognition
Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
 
User-driven Quality Evaluation of DBpedia
User-driven Quality Evaluation of DBpediaUser-driven Quality Evaluation of DBpedia
User-driven Quality Evaluation of DBpedia
 
Converting GHO to RDF
Converting GHO to RDFConverting GHO to RDF
Converting GHO to RDF
 
ReDD-Observatory
ReDD-ObservatoryReDD-Observatory
ReDD-Observatory
 

Recently uploaded

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Recently uploaded (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

Detecting Data Quality Issues via Crowdsourcing

  • 1. www.kit.edu @Data Quality Tutorial, September 12, 2016 Crowdsourcing Linked Data Quality Assessment Amrapali Zaveri
  • 2. Linked Data - over billion facts What about the quality?
  • 3. Motivation - Linked Data Quality Varying quality of Linked Data sources Source, Extraction, Integration etc. Some quality issues require certain interpretation that can be easily performed by humans Incompleteness Incorrectness Semantic Accuracy
  • 4. Motivation - Linked Data Quality Solution: Include human verification in the process of LD quality assessment via crowdsourcing Human Intelligent Tasks (HIT) Labor market Monetary Reward/Incentive Time & Cost effective Large-scale problem solving approach, divided into smaller tasks, independently solved by a large group of people.
  • 5. Research questions RQ1: Is it possible to detect quality issues in LD data sets via crowdsourcing mechanisms? RQ2: What type of crowd is most suitable for each type of quality issue? RQ3: Which types of errors are made by lay users and experts when assessing RDF triples?
  • 6. Related work Crowdsourcing & Linked Data Web of data quality assessment Our work ZenCrowd Entity resolution CrowdMAP Ontology alignment GWAP for LD Assessing LD mappings (Automatic) Quality characteristics of LD data sources (Semi-automatic) DBpedia WIQA, Sieve, (Manual)
  • 8. Find-Verify Phases of Crowdsourcing Contest LD Experts Difficult task Final prize Find Verify Microtasks Workers Easy task Micropayments TripleCheckMate [Kontoskostas2013] MTurk (1) Adapted from [Bernstein2010] http://mturk.com
  • 9. LD Experts AMT Workers Type Contest-based Human Intelligent Tasks (HITs) Participants Linked Data (LD) experts Labor market Task Detect and classify quality issues in resources Detect quality issues in triples Reward Most no. of resources evaluated Per task/triple Tool TripleCheckMate Amazon Mechanical Turk, CrowdFlower etc. Difference between LD experts & Workers
  • 10. Methodology Crowdsource using: • Linked Data experts - Find Phase • Amazon Mechanical Turk workers - Verify Phase
  • 11. Crowdsourcing using Linked Data Experts — Methodology Phase I: Creation of quality problem taxonomy Phase II: Launching a contest
  • 12. Zaveri et. al. Quality assessment methodologies for Linked Open Data. Semantic Web Journal, 2015. D = Detectable, F = Fixable, E = Extraction Framework, M = Mappings Wiki Crowdsourcing using Linked Data Experts — Quality Problem Taxonomy
  • 13. D = Detectable, F = Fixable, E = Extraction Framework, M = Mappings Wiki Crowdsourcing using Linked Data Experts — Quality Problem Taxonomy D = Detectable, F = Fixable, E = Extraction Framework, M = Mappings Wiki
  • 15. Crowdsourcing using Linked Data Experts — Results
  • 16. Methodology Crowdsource using: • Linked Data experts - Find Phase • Amazon Mechanical Turk workers - Verify Phase
  • 17. Crowdsourcing using AMT Workers Selecting LD quality issues to crowdsource Designing and generating the micro tasks to present the data to the crowd 1 2 Dataset {s p o .} {s p o .} Correct Incorrect + Quality issue Steps: 1 2 3
  • 18. Three categories of quality problems occur pervasively in DBpedia and can be crowdsourced: Incorrect/Incomplete object ▪Example: dbpedia:Dave_Dobbyn dbprop:dateOfBirth “3”. Incorrect data type or language tags ▪Example: dbpedia:Torishima_Izu_Islands foaf:name “ ”@en. Incorrect link to “external Web pages” ▪Example: dbpedia:John-Two-Hawks dbpedia-owl:wikiPageExternalLink <http://cedarlakedvd.com/> Selecting LD quality issues to crowdsource 1
  • 19. Presenting the data to the crowd • Selection of foaf:name or rdfs:label to extract human- readable descriptions • Values extracted automatically from Wikipedia infoboxes • Link to the Wikipedia article via foaf:isPrimaryTopicOf • Preview of external pages by implementing HTML iframe Microtask interfaces: MTurk tasks Incorrect object Incorrect data type or language tag Incorrect outlink 2
  • 21. Experimental design • Crowdsourcing approaches: • Find stage: Contest with LD experts • Verify stage: Microtasks • Creation of a gold standard: • Two of the authors of this paper (MA, AZ) generated the gold standard for all the triples obtained from the contest • Each author independently evaluated the triples • Conflicts were resolved via mutual agreement • Metric: precision
  • 22. Overall results LD Experts Microtask workers Number of distinct participants 50 80 Total time 3 weeks (predefined) 4 days Total triples evaluated 1,512 1,073 Total cost ~ US$ 400 (predefined) ~ US$ 43
  • 23. Precision results: Incorrect object task • MTurk workers can be used to reduce the error rates of LD experts for the Find stage • 117 DBpedia triples had predicates related to dates with incorrect/ incomplete values: ”2005 Six Nations Championship” Date 12 . • 52 DBpedia triples had erroneous values from the source: ”English (programming language)” Influenced by ? . • Experts classified all these triples as incorrect • Workers compared values against Wikipedia and successfully classified this triples as “correct” Triples compared LD Experts MTurk (majority voting: n=5) 509 0.7151 0.8977
  • 24. = Precision results: Incorrect data type task Numberoftriples 0 38 75 113 150 Data types Date Millimetre Number Second Year Experts TP Experts FP Crowd TP Crowd FP Triples compared LD Experts MTurk (majority voting: n=5) 341 0.8270 0.4752
  • 25. Precision results: Incorrect link task • We analyzed the 189 misclassifications by the experts: • The misclassifications by the workers correspond to pages with a language different from English. 11% 39% 50% Freebase links Wikipedia images External links Triples compared Baseline LD Experts MTurk (n=5 majority voting) 223 0.2598 0.1525 0.9412
  • 26. Final discussion RQ1: Is it possible to detect quality issues in LD data sets via crowdsourcing mechanisms? LD experts - incorrect datatype AMT workers - incorrect/incomplete object value, incorrect interlink RQ2: What type of crowd is most suitable for each type of quality issue? The effort of LD experts must be applied on those tasks demanding specific-domain skills. AMT workers were exceptionally good at performing data comparisons RQ3: Which types of errors are made by lay users and experts? Lay users do not have the skills to solve domain-specific tasks, while experts performance is very low on tasks that demand an extra effort (e.g., checking an external page)
  • 28. Conclusions A crowdsourcing methodology for LD quality assessment: Find stage: LD experts Verify stage: AMT workers Methodology and tool are generic to be applied to other scenarios Crowdsourcing approaches are feasible in detecting the studied quality issues
  • 29. Challenges Lack of gold-standard Crowdsourcing design — how many workers? how many tasks? reward? Microtask design
  • 30. Future Work Combining semi-automated and crowdsourcing methods Predicted vs. crowdsourced metadata Conducting new experiments (other domains) Entity, dataset, experimental metadata Fix/Improve Quality using Crowdsourcing Find-Fix-Verify Phases
  • 31. References Triplecheckmate: A tool for crowdsourcing the quality assessment of linked data. D Kontokostas, A Zaveri, S Auer, J Lehmann, ISWC 2013. Crowdsourcing linked data quality assessment. M Acosta, A Zaveri, E Simperl, D Kontokostas, S Auer, J Lehmann, ISWC 2013. User-driven quality evaluation of DBpedia. A Zaveri, D Kontokostas, MA Sherif, L Bühmann, M Morsey, S Auer, J Lehmann, ISEMANTiCS 2013. Quality assessment for linked data: A survey. A Zaveri, A Rula, A Maurino, R Pietrobon, J Lehmann, S Auer, Semantic Web Journal 2015. Detecting Linked Data Quality Issues via Crowdsourcing: A DBpedia Study. M Acosta, A Zaveri, E Simperl, D Kontokostas, F Flöck, J Lehmann, Semantic Web Journal 2016. ACRyLIQ: Leveraging DBpedia for Adaptive Crowdsourcing in Linked Data Quality Assessment. U Hassan, A Zaveri, E Marx, E Curry, J Lehmann. EKAW 2016.