New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Data excellence: Better data for better AI
1. Data Excellence:
Better Data for Better AI
ODSC 2020
Lora Aroyo
http://lora-aroyo.org
@laroyo
By Scanned from The Magic of M. C. Escher. (Harry N. Abrams, Inc. ISBN
0-8109-6720-0) by Justin Foote (talk)., Fair use,
https://en.wikipedia.org/w/index.php?curid=3955850
2. http://lora-aroyo.org @laroyo
TAKE HOME MESSAGE
2
data lifecycle - just like in software - is needed to
guide data research & development practices
data is the compass for AI - AI advances where
there is data
data is at the center - AI systems success
depends on the quality of their data
https://en.wikipedia.org/wiki/Metamorphosis_II
data quality must be addressed in AI practices
- multitude of notions of truth
- necessity for data quality standards
data lifecycle is the backbone for data
excellence tools and practices to stay ahead of
future unintended AI behaviours
4. http://lora-aroyo.org @laroyo 4
The Rise of the Machines
“AI Winter” → “AI Breakthroughs in Games”
IBM Watson Jeopardy
DeepMind AlphaGo
beat the humans
5. http://lora-aroyo.org @laroyo 5
The Rise of the Machines
“AI Winter” → “AI Breakthroughs in Games” → “Real World Tasks”
Health diagnostics
Flue prediction
Weather prediction
Text, Image and Video classification
Text Generation
Text Translation
Conversational AI
support the humans
6. http://lora-aroyo.org @laroyo 6
Mainstream Deployment of AI
“Real World Tasks” deployed in the wild → Unintended behaviors
Microsoft Tay bot
IBM Watson Oncology
Amazon Rekognition
Google Photos
Apple Face ID
Facebook chat bots
Various Speech Assistants
7. http://lora-aroyo.org @laroyo 7
getting computers to “see”
the diversity of data
data quality is essential for
guiding AI away from
unintended behaviours
Data is the compass for AI
8. http://lora-aroyo.org @laroyo 8
The Life of AI Data
“It exists!”
bootstrapping AI with data
Caltech101
LabelMe
Berkley-3D
https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research
9. http://lora-aroyo.org @laroyo 9
The Life of AI Data
“It exists!” → “It is bigger!”
data hungry AI
ImageNet
SIFT10M
OpenImages
COCO
Web 1T 5-Gram
https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research
11. http://lora-aroyo.org @laroyo 11
The Life of AI Data
“It exists!” → “It is bigger!” → “It is better!”
but before it got better ...
it got worse ...
13. http://lora-aroyo.org @laroyo 13
The Life of AI Data
“It exists!” → “It is bigger!” → “It is better!”
but before it got better ...
reactive
data improvement
14. http://lora-aroyo.org @laroyo 14
The Life of AI Data
“It exists!” → “It is bigger!” → “It is better!”
to reach here
we need proactive
data improvement
15. http://lora-aroyo.org @laroyo 15
The Life of AI Data
Alon Halevy, Peter Norvig, and Fernando Pereira. 2009. The Unreasonable Effectiveness of Data. IEEE Intelligent Systems 24, 2 (2009)
In the decade since then, the research community have done a lot
with quantity, but quality has been left behind
16. http://lora-aroyo.org @laroyo 16
In the 90’s we introduced standards
to achieve Software reliability
introduced software engineering lifecycle
- requirements, design and testing
established processes for software maintenance
- version control, sharing, documenting
established software quality metrics & processes
Ben Hutchinson, 2020
17. http://lora-aroyo.org @laroyo 17
Now we need the same for Data
introduce data lifecycle
- requirements, design and testing
establish processes for dataset maintenance
- version control, sharing, documenting
establish data quality metrics & processes
Ben Hutchinson, 2020
18. http://lora-aroyo.org @laroyo 18
data quality is typically not
caused by software bugs or just
by human errors
dataset are not easy to debug
data quality is typically result of:
- how well a dataset
represent the actual task
- how is the annotation done
- are the quality metrics
adequate
Data Quality is not easy ...
19. http://lora-aroyo.org @laroyo
it is not easy to give Y/N answer
for most of our AI tasks
19
Do these images depict a GUITAR ?
Data Quality is not only human error
✓
✓ ✓
✘
✘
✘✘✓
✓
20. http://lora-aroyo.org @laroyo 20
Do these images depict NEW ZEALAND ?
Data Quality should consider context of use
it is not easy to give Y/N answer
for most of our AI tasks
the answer typically depends on
the context, on the task, on the
usage, etc
✓ ✘
✓ ✓ ✘
✘
21. http://lora-aroyo.org @laroyo 21
Do these images depict a WEDDING ?
Data Quality should include real world diversity
it is not easy to give Y/N answer
for most of our AI tasks
the answer typically depends on
the context, on the task, on the
usage, etc
disagreement is signal for
diversity and should be included
in AI training
✓
✘
✓
✓
✘
✓
22. http://lora-aroyo.org @laroyo 22
Does the Sentence expresses
Does the sentence express TREATS relation between Chloroquine, Malaria?
Data Quality is difficult even with experts
For prevention of malaria, use only in individuals traveling to malarious
areas where CHLOROQUINE resistant P. falciparum MALARIA
has not been reported.
Rheumatoid arthritis and MALARIA have been treated
with CHLOROQUINE for decades.
Among 56 subjects reporting to a clinic with symptoms of MALARIA
53 (95%) had ordinarily effective levels of CHLOROQUINE in blood.
✓
✘
✓
24. http://lora-aroyo.org @laroyo 24
Does the Sentence expresses
Model of semantic interpretation
TRIANGLE OF MEANING
“Three Sides of CrowdTruth”, Human Computation Journal, v1, 2014, L. Aroyo, C. Welty
Workshop on “Subjectivity, Ambiguity and Disagreement (SAD) in Crowdsourcing”, The Web Conference 2019, https://sadworkshop.wordpress.com/
Annotator disagreement
is signal, not noise
Annotator disagreement
is indicative of
variation in human
interpretation
Annotator disagreement
is indicative of
ambiguity, vagueness,
similarity, over-generality,
& quality
25. http://lora-aroyo.org @laroyo 25
Three sides of human interpretation
CROWDTRUTH Disagreement provides
guidance in task analysis:
● items with poor semantics
● items with salient terms
● items difficult to classify
● items that are ambiguous
● subjective annotations
● time-sensitive annotations
● difficult annotation tasks
● mis-translated annotations
● users with/without
specific knowledge
● communities of thought
● spammers
You can’t remove the corners…
“Three Sides of CrowdTruth”, Human Computation Journal, v1, 2014, L. Aroyo, C. Welty
27. http://lora-aroyo.org @laroyo 27
One truth: knowledge acquisition typically assumes one
correct interpretation for every example
Experts rule: knowledge is captured from domain experts
One is enough: single expert’s knowledge is sufficient
Disagreement bad: when people disagree, they must not
understand the problem
Detailed explanations help: if examples cause
disagreement - adding instructions should help
Once done, forever valid: knowledge is not updated; new
data not aligned with old
All examples are created equal: triples are triples, one is
not more important than another, they are all either true or
false
… and we force the smoothness into a binary form
7 Myths about Human Annotation
“Truth is a Lie: 7 Myths about Human Annotation”, AI Magazine 2014, L. Aroyo, C. Welty
28. http://lora-aroyo.org @laroyo 28
High Quality Data
represents a phenomena
accurately and consistently over time
and is replicable, reproducible,
and maintainable over time;
has empirical and explanatory power;
and is collected, stored, and used
responsibly.
Rigorous Evaluation of AI Systems workshop, 2019, Human Computation (HCOMP), http://eval.how/
Evaluating Evaluation for AI Systems workshop, 2020, Association for the Advancement of Artificial Intelligence (AAAI), http://eval.how/aaai-2020/
29. http://lora-aroyo.org @laroyo 29
From Data Quality to Data Excellence
Data Quality is
- a point-estimate of goodness of data
Data Excellence is
- the set of practices and tools that result in
high quality data
30. http://lora-aroyo.org @laroyo 30
How do we achieve Data Excellence?
Maintainability
Well documented datasets with
owners, which follow best practices
for data at any scale.
Reproducibility
Basic and critical regression tests
for datasets which suppo solid
conclusions for decision making.
Reliability
Datasets which are internally sound
and consistent; factors that a ect
the data are addressed or disclosed.
Fidelity
Data which faithfully, accurately, and
comprehensively represents the
captured phenomenon.
Validity
Datasets which explain aspects of
the phenomena that they represent
in terms of external measures.
1st International Workshop on Data Excellence: http://eval.how/dew2020/
Utility
Data which adequately and
accurately achieves the intended
product behavior.
31. http://lora-aroyo.org @laroyo 31
much like in software lifecycles, cutting corners at each stage
cascades to subsequent versions, which lead to technical debt
Dataset [Requirements] Analysis
Requirements Analysis
Stakeholder Input
Privacy, compliance
Trust & safety planning
Dataset Maintenance
Updating data over time
Extending to other languages
Version control
Storage and accessibility
Dataset Design
Data acquisition methodology
Rater guidelines
Construct validation
Dataset Testing
Representation metrics
Fairness metrics
Reliability metrics
Approval process
Dataset Implementation
Human labeled data
Logging interaction data
Data
Lifecycle
Ben Hutchinson, 2020
32. http://lora-aroyo.org @laroyo
TAKE HOME MESSAGE
32
https://en.wikipedia.org/wiki/Metamorphosis_II
data lifecycle - just like in software - is needed to
guide data research & development practices
data is the compass for AI - AI advances where
there is data
data is at the center - AI systems success
depends on the quality of their data
data quality must be addressed in AI practices
- multitude of notions of truth
- necessity for data quality standards
data lifecycle is the backbone for data
excellence tools and practices to stay ahead of
future unintended AI behaviours
33. http://lora-aroyo.org @laroyo 33
Collaborators
EthicalAI
Ben Hutchinson
Crowd Platform
Amol Wankhede
Anurag Batra
People + AI Research (PAIR)
Nithya Sambasivan
Kristen Olson
Shivani Kapania
Jess Holbrook
Andrew Zaldivar
Mahima Pushkarna
Maysam Moussalem
Praveen Paritosh Ka Wong
Lora Aroyo Devi Krishna
Likert team
34. Data Excellence:
Better Data for Better AI
ODSC 2020
Lora Aroyo
http://lora-aroyo.org
@laroyo
By Scanned from The Magic of M. C. Escher. (Harry N. Abrams, Inc. ISBN
0-8109-6720-0) by Justin Foote (talk)., Fair use,
https://en.wikipedia.org/w/index.php?curid=3955850
35. high profile data failure
not bugs in the software, not mistake of humans
problems caused by quality in the data
just like software quality in 90’s - the same has to happen with data
examples of questionable data
crowdtruth relation extraction
how would you annotate it
how do we know and measure the quality of the data
how well does it represent the actual task we are trying to solve
like software we need to establish data quality standards