[Video recording available at https://www.youtube.com/playlist?list=PLewjn-vrZ7d3x0M4Uu_57oaJPRXkiS221]
Artificial Intelligence is increasingly playing an integral role in determining our day-to-day experiences. Moreover, with proliferation of AI based solutions in areas such as hiring, lending, criminal justice, healthcare, and education, the resulting personal and professional implications of AI are far-reaching. The dominant role played by AI models in these domains has led to a growing concern regarding potential bias in these models, and a demand for model transparency and interpretability. In addition, model explainability is a prerequisite for building trust and adoption of AI systems in high stakes domains requiring reliability and safety such as healthcare and automated transportation, and critical industrial applications with significant economic implications such as predictive maintenance, exploration of natural resources, and climate change modeling.
As a consequence, AI researchers and practitioners have focused their attention on explainable AI to help them better trust and understand models at scale. The challenges for the research community include (i) defining model explainability, (ii) formulating explainability tasks for understanding model behavior and developing solutions for these tasks, and finally (iii) designing measures for evaluating the performance of models in explainability tasks.
In this tutorial, we present an overview of model interpretability and explainability in AI, key regulations / laws, and techniques / tools for providing explainability as part of AI/ML systems. Then, we focus on the application of explainability techniques in industry, wherein we present practical challenges / guidelines for effectively using explainability techniques and lessons learned from deploying explainable models for several web-scale machine learning and data mining applications. We present case studies across different companies, spanning application domains such as search & recommendation systems, hiring, sales, and lending. Finally, based on our experiences in industry, we identify open problems and research directions for the data mining / machine learning community.
1. Explainable AI in Industry
KDD 2019 Tutorial
Sahin Cem Geyik, Krishnaram Kenthapadi & Varun Mithal
Krishna Gade & Ankur Taly
https://sites.google.com/view/kdd19-explainable-ai-tutorial1
2. Agenda
● Motivation
● AI Explainability: Foundations and Techniques
○ Explainability concepts, problem formulations, and evaluation methods
○ Post hoc Explainability
○ Intrinsically Explainable models
● AI Explainability: Industrial Practice
○ Case Studies from LinkedIn, Fiddler Labs, and Google Research
● Demo
● Key Takeaways and Challenges
2
4. Third Wave of AI
Factors driving rapid advancement of AI
Symbolic AI
Logic rules represent
knowledge
No learning capability and poor
handling of uncertainty
Statistical AI
Statistical models for specific
domains training on big data
No contextual capability and
minimal explainability
Explainable AI
Systems construct
explanatory models
Systems learn and reason
with new tasks and situations
GPUs , On-chip
Neural Network
Data
Availability
Cloud
Infrastructure
New
Algorithms
5. Need for Explainable AI
● Machine Learning centric today.
● ML Models are opaque, non-
intuitive and difficult to
understand
Marketing
Security Logistics
Finance
Current AI
Systems
User
● Why did you do that?
● Why not something else?
● When do you succeed or
fail?
● How do I correct an error?
● When do I trust you?
Explainable AI and ML is essential for future customers to understand, trust, and effectively manage the
emerging generation of AI applications
7. Internal Audit, Regulators
IT & Operations
Data Scientists
Business Owner
Can I trust our AI
decisions?
Are these AI system
decisions fair?
Customer Support
How do I answer this
customer complaint?
How do I monitor and
debug this model?
Is this the best model
that can be built?
Black-box
AI
Why I am getting this
decision?
How can I get a better
decision?
Poor Decision
Black-box AI creates confusion and doubt
8. What is Explainable AI?
Data
Black-Box
AI
AI
product
Confusion with Today’s AI Black
Box
● Why did you do that?
● Why did you not do that?
● When do you succeed or fail?
● How do I correct an error?
Black Box AI
Decision,
Recommendation
Clear & Transparent Predictions
● I understand why
● I understand why not
● I know why you succeed or fail
● I understand, so I trust you
Explainable AI
Data
Explainable
AI
Explainable
AI Product
Decision
Explanation
Feedback
9. Why Explainability: Verify the ML Model / System
9
Credit: Samek, Binder, Tutorial on Interpretable ML, MICCAI’18
12. 12
Why Explainability: Learn Insights in the Sciences
Credit: Samek, Binder, Tutorial on Interpretable ML, MICCAI’18
13. Why Explainability: Debug (Mis-)Predictions
13
Top label: “clog”
Why did the network label
this image as “clog”?
14. Immigration Reform and Control Act
Citizenship
Rehabilitation Act of 1973;
Americans with Disabilities Act
of 1990
Disability status
Civil Rights Act of 1964
Race
Age Discrimination in Employment Act
of 1967
Age
Equal Pay Act of 1963;
Civil Rights Act of 1964
Sex
And more...
Why Explainability: Laws against Discrimination
14
16. GDPR Concerns Around Lack of Explainability in AI
“
Companies should commit to ensuring
systems that could fall under GDPR, including
AI, will be compliant. The threat of sizeable
fines of €20 million or 4% of global
turnover provides a sharp incentive.
Article 22 of GDPR empowers individuals with
the right to demand an explanation of how
an AI system made a decision that affects
them.
”
- European Commision
VP, European Commision
19. 19
SR 11-7 and OCC regulations for Financial Institutions
20. Model Diagnostics
Root Cause Analytics
Performance monitoring
Fairness monitoring
Model Comparison
Cohort Analysis
Explainable Decisions
API Support
Model Launch Signoff
Model Release Mgmt
Model Evaluation
Compliance Testing
Model Debugging
Model Visualization
Explainable
AI
Train
QA
Predict
Deploy
A/B Test
Monitor
Debug
Feedback Loop
“Explainability by Design” for AI products
23. Achieving Explainable AI
Approach 1: Post-hoc explain a given AI model
● Individual prediction explanations in terms of input features, influential examples,
concepts, local decision rules
● Global prediction explanations in terms of entire model in terms of partial
dependence plots, global feature importance, global decision rules
Approach 2: Build an interpretable model
● Logistic regression, Decision trees, Decision lists and sets, Generalized Additive
Models (GAMs)
23
24. Achieving Explainable AI
Approach 1: Post-hoc explain a given AI model
● Individual prediction explanations in terms of input features, influential examples,
concepts, local decision rules
● Global prediction explanations in terms of entire model in terms of partial
dependence plots, global feature importance, global decision rules
Approach 2: Build an interpretable model
● Logistic regression, Decision trees, Decision lists and sets, Generalized Additive
Models (GAMs)
24
27. Credit Line Increase
Fair lending laws [ECOA, FCRA] require credit decisions to be explainable
Bank Credit Lending Model
Why? Why not?
How?
? Request Denied
Query AI System
Credit Lending Score = 0.3
Credit Lending in a black-box ML world
28. Attribute a model’s prediction on an input to features of the input
Examples:
● Attribute an object recognition network’s prediction to its pixels
● Attribute a text sentiment network’s prediction to individual words
● Attribute a lending model’s prediction to its features
A reductive formulation of “why this prediction” but surprisingly useful :-)
The Attribution Problem
29. Application of Attributions
● Debugging model predictions
E.g., Attribution an image misclassification to the pixels responsible for it
● Generating an explanation for the end-user
E.g., Expose attributions for a lending prediction to the end-user
● Analyzing model robustness
E.g., Craft adversarial examples using weaknesses surfaced by attributions
● Extract rules from the model
E.g., Combine attribution to craft rules (pharmacophores) capturing prediction
logic of a drug screening network
29
30. Next few slides
We will cover the following attribution methods**
● Ablations
● Gradient based methods
● Score Backpropagation based methods
● Shapley Value based methods
**Not a complete list!
See Ancona et al. [ICML 2019], Guidotti et al. [arxiv 2018] for a comprehensive survey
30
31. Ablations
Drop each feature and attribute the change in prediction to that feature
Useful tool but not a perfect attribution method. Why?
● Unrealistic inputs
● Improper accounting of interactive features
● Computationally expensive
31
32. Feature*Gradient
Attribution to a feature is feature value times gradient, i.e., xi* 𝜕y/𝜕
xi
● Gradient captures sensitivity of output w.r.t. feature
● Equivalent to Feature*Coefficient for linear models
○ First-order Taylor approximation of non-linear models
● Popularized by SaliencyMaps [NIPS 2013], Baehrens et al. [JMLR 2010]
32
Gradients in the
vicinity of the input
seem like noise
34. IG(input, base) ::= (input - base) * ∫0 -1▽F(𝛂*input + (1-𝛂)*base) d𝛂
Original image Integrated Gradients
Integrated Gradients [ICML 2017]
Integrate the gradients along a straight-line path from baseline to input
35. What is a baseline?
● Ideally, the baseline is an informationless input for the model
○ E.g., Black image for image models
○ E.g., Empty text or zero embedding vector for text models
● Integrated Gradients explains F(input) - F(baseline) in terms of input features
Aside: Baselines (or Norms) are essential to explanations [Kahneman-Miller 86]
● E.g., A man suffers from indigestion. Doctor blames it to a stomach ulcer. Wife blames
it on eating turnips. Both are correct relative to their baselines.
● The baseline may also be an important analysis knob.
38. Detecting an architecture bug
● Deep network [Kearns, 2016] predicts if a molecule binds to certain DNA site
● Finding: Some atoms had identical attributions despite different connectivity
39. ● Deep network [Kearns, 2016] predicts if a molecule binds to certain DNA site
● Finding: Some atoms had identical attributions despite different connectivity
Detecting an architecture bug
● Bug: The architecture had a bug due to which the convolved bond features
did not affect the prediction!
40. ● Deep network predicts various diseases from chest x-rays
Original image
Integrated gradients
(for top label)
Detecting a data issue
41. ● Deep network predicts various diseases from chest x-rays
● Finding: Attributions fell on radiologist’s markings (rather than the pathology)
Original image
Integrated gradients
(for top label)
Detecting a data issue
42. Score Back-Propagation based Methods
Re-distribute the prediction score through the neurons in the network
● LRP [JMLR 2017], DeepLift [ICML 2017], Guided BackProp [ICLR 2014]
Easy case: Output of a neuron is a linear function
of previous neurons (i.e., ni = ⅀ wij * nj)
e.g., the logit neuron
● Re-distribute the contribution in proportion to
the coefficients wij
42
Image credit heatmapping.org
43. Score Back-Propagation based Methods
Re-distribute the prediction score through the neurons in the network
● LRP [JMLR 2017], DeepLift [ICML 2017], Guided BackProp [ICLR 2014]
Tricky case: Output of a neuron is a non-linear
function, e.g., ReLU, Sigmoid, etc.
● Guided BackProp: Only consider ReLUs that
are on (linear regime), and which contribute
positively
● LRP: Use first-order Taylor decomposition to
linearize activation function
● DeepLift: Distribute activation difference
relative a reference point in proportion to
edge weights 43
Image credit heatmapping.org
44. Score Back-Propagation based Methods
Re-distribute the prediction score through the neurons in the network
● LRP [JMLR 2017], DeepLift [ICML 2017], Guided BackProp [ICLR 2014]
Pros:
● Conceptually simple
● Methods have been empirically validated to
yield sensible result
Cons:
● Hard to implement, requires instrumenting
the model
● Often breaks implementation invariance
Think: F(x, y, z) = x * y *z and
G(x, y, z) = x * (y * z)
Image credit heatmapping.org
45. So far we’ve looked at differentiable models.
But, what about non-differentiable models? E.g.,
● Decision trees
● Boosted trees
● Random forests
● etc.
46. Classic result in game theory on distributing gain in a coalition game
● Coalition Games
○ Players collaborating to generate some gain (think: revenue)
○ Set function v(S) determining the gain for any subset S of players
Shapley Value [Annals of Mathematical studies,1953]
47. Classic result in game theory on distributing gain in a coalition game
● Coalition Games
○ Players collaborating to generate some gain (think: revenue)
○ Set function v(S) determining the gain for any subset S of players
● Shapley Values are a fair way to attribute the total gain to the players based on
their contributions
○ Concept: Marginal contribution of a player to a subset of other players (v(S U {i}) - v(S))
○ Shapley value for a player is a specific weighted aggregation of its marginal over all
possible subsets of other players
Shapley Value for player i = ⅀S⊆N w(S) * (v(S U {i}) - v(S))
(where w(S) = N! / |S|! (N - |S| -1)!)
Shapley Value [Annals of Mathematical studies,1953]
48. Shapley values are unique under four simple axioms
● Dummy: If a player never contributes to the game then it must receive zero attribution
● Efficiency: Attributions must add to the total gain
● Symmetry: Symmetric players must receive equal attribution
● Linearity: Attribution for the (weighted) sum of two games must be the same as the
(weighted) sum of the attributions for each of the games
Shapley Value Justification
49. SHAP [NeurIPS 2018], QII [S&P 2016], Strumbelj & Konenko [JMLR 2009]
● Define a coalition game for each model input X
○ Players are the features in the input
○ Gain is the model prediction (output), i.e., gain = F(X)
● Feature attributions are the Shapley values of this game
Shapley Values for Explaining ML models
50. SHAP [NeurIPS 2018], QII [S&P 2016], Strumbelj & Konenko [JMLR 2009]
● Define a coalition game for each model input X
○ Players are the features in the input
○ Gain is the model prediction (output), i.e., gain = F(X)
● Feature attributions are the Shapley values of this game
Challenge: Shapley Values require the gain to be defined for all subsets of players
● What is the prediction when some players (features) are absent?
i.e., what is F(x_1, <absent>, x_3, …, <absent>)?
Shapley Values for Explaining ML models
51. Key Idea: Take the expected prediction when the (absent) feature is sampled
from a certain distribution.
Different approaches choose different distributions
● [SHAP, NIPS 2018] Use conditional distribution w.r.t. the present features
● [QII, S&P 2016] Use marginal distribution
● [Strumbelj et al., JMLR 2009] Use uniform distribution
● [Integrated Gradients, ICML 2017] Use a specific baseline point
Modeling Feature Absence
52. Exact Shapley value computation is exponential in the number of features
● Shapley values can be expressed as an expectation of marginals
𝜙(i) = ES ~ D [marginal(S, i)]
● Sampling-based methods can be used to approximate the expectation
● See: “Computational Aspects of Cooperative Game Theory”, Chalkiadakis et al. 2011
● The method is still computationally infeasible for models with hundreds of
features, e.g., image models
Computing Shapley Values
54. Human Review
Have humans review attributions and/or compare them to (human provided)
groundtruth on “feature importance”
Pros:
● Helps assess if attributions are human-intelligible
● Helps increase trust in the attribution method
Cons:
● Attributions may appear incorrect because model reasons differently
● Confirmation bias
55. Perturbations (Samek et al., IEEE NN and LS 2017)
Perturb top-k features by attribution and observe change in prediction
● Higher the change, better the method
● Perturbation may amount to replacing the feature with a random value
● Samek et al. formalize this using a metric: Area over perturbation curve
○ Plot the prediction for input with top-k features perturbed as a function of k
○ Take the area over this curve
Prediction for
perturbed inputs
Number of
perturbed features
10 20 30 40 50 60
Drop in prediction
when top 40 features
are perturbed
Area over
perturbation
curve
56. Axiomatic Justification
Inspired by how Shapley Values are justified
● List desirable criteria (axioms) for an attribution method
● Establish a uniqueness result: X is the only method that satisfies these criteria
Integrated Gradients, SHAP, QII, Strumbelj & Konenko are justified in this manner
Theorem [Integrated Gradients, ICML 2017]: Integrated Gradients is the
unique path-integral method satisfying: Sensitivity, Insensitivity, Linearity
preservation, Implementation invariance, Completeness, and Symmetry
58. Attributions are pretty shallow
Attributions do not explain:
● Feature interactions
● What training examples influenced the prediction
● Global properties of the model
An instance where attributions are useless:
● A model that predicts TRUE when there are even number of black pixels and
FALSE otherwise
59. Attributions are for human consumption
● Humans interpret attributions and generate insights
○ Doctor maps attributions for x-rays to pathologies
● Visualization matters as much as the attribution technique
60. Attributions are for human consumption
Naive scaling of attributions
from 0 to 255
Attributions have a large
range and long tail
across pixels
After clipping attributions
at 99% to reduce range
● Humans interpret attributions and generate insights
○ Doctor maps attributions for x-rays to pathologies
● Visualization matters as much as the attribution technique
62. Example based Explanations
62
● Prototypes: Representative of all the training data.
● Criticisms: Data instance that is not well represented by the set of prototypes.
Figure credit: Examples are not Enough, Learn to Criticize! Criticism for Interpretability. Kim, Khanna and Koyejo. NIPS 2016
Learned prototypes and criticisms from Imagenet dataset (two types of dog breeds)
63. Influence functions
● Trace a model’s prediction through the learning algorithm and
back to its training data
● Training points “responsible” for a given prediction
63
Figure credit: Understanding Black-box Predictions via Influence Functions. Koh and Liang ICML 2017
64. Local Interpretable Model-agnostic Explanations
(Ribeiro et al. KDD 2016)
64
Figure credit: Anchors: High-Precision Model-Agnostic
Explanations. Ribeiro et al. AAAI 2018
Figure credit: Ribeiro et al. KDD 2016
68. Global Explanations Methods
● Partial Dependence Plot: Shows
the marginal effect one or two
features have on the predicted
outcome of a machine learning model
68
69. Global Explanations Methods
● Permutations: The importance of a feature is the increase in the prediction error of the model
after we permuted the feature’s values, which breaks the relationship between the feature and the
true outcome.
69
70. Achieving Explainable AI
Approach 1: Post-hoc explain a given AI model
● Individual prediction explanations in terms of input features, influential examples,
concepts, local decision rules
● Global prediction explanations in terms of entire model in terms of partial
dependence plots, global feature importance, global decision rules
Approach 2: Build an interpretable model
● Logistic regression, Decision trees, Decision lists and sets, Generalized Additive
Models (GAMs)
70
71. Decision Trees
71
Is the person fit?
Age < 30 ?
Eats a lot of pizzas?
Exercises in the morning?
Unfit UnfitFit Fit
Yes No
Yes
Yes No
No
72. Decision List
72
Figure credit: Interpretable Decision Sets: A Joint Framework for Description and Prediction, Lakkaraju, Bach,
Leskovec
73. Decision Set
73
Figure credit: Interpretable Decision Sets: A Joint Framework for Description and Prediction, Lakkaraju, Bach,
Leskovec
74. GLMs and GAMs
74
Intelligible Models for Classification and Regression. Lou, Caruana and Gehrke KDD 2012
Accurate Intelligible Models with Pairwise Interactions. Lou, Caruana, Gehrke and Hooker. KDD 2013
77. LinkedIn Recruiter
● Recruiter Searches for Candidates
○ Standardized and free-text search criteria
● Retrieval and Ranking
○ Filter candidates using the criteria
○ Rank candidates in multiple levels using ML
models
77
78. Modeling Approaches
● Pairwise XGBoost
● GLMix
● DNNs via TensorFlow
● Optimization Criteria: inMail Accepts
○ Positive: inMail sent by recruiter, and positively responded by candidate
■ Mutual interest between the recruiter and the candidate
78
80. How We Utilize Feature Importances for GBDT
● Understanding feature digressions
○ Which a feature that was impactful no longer is?
○ Should we debug feature generation?
● Introducing new features in bulk and identifying effective ones
○ An activity feature for last 3 hours, 6 hours, 12 hours, 24 hours introduced (costly to compute)
○ Should we keep all such features?
● Separating the factors for that caused an improvement
○ Did an improvement come from a new feature, or a new labeling strategy, data source?
○ Did the ordering between features change?
● Shortcoming: A global view, not case by case
80
81. GLMix Models
● Generalized Linear Mixed Models
○ Global: Linear Model
○ Per-contract: Linear Model
○ Per-recruiter: Linear Model
● Lots of parameters overall
○ For a specific recruiter or contract the weights can be summed up
● Inherently explainable
○ Contribution of a feature is “weight x feature value”
○ Can be examined in a case-by-case manner as well
81
82. TensorFlow Models in Recruiter and Explaining Them
● We utilize the Integrated Gradients [ICML 2017] method
● How do we determine the baseline example?
○ Every query creates its own feature values for the same candidate
○ Query match features, time-based features
○ Recruiter affinity, and candidate affinity features
○ A candidate would be scored differently by each query
○ Cannot recommend a “Software Engineer” to a search for a “Forensic Chemist”
○ There is no globally neutral example for comparison!
82
83. Query-Specific Baseline Selection
● For each query:
○ Score examples by the TF model
○ Rank examples
○ Choose one example as the baseline
○ Compare others to the baseline example
● How to choose the baseline example
○ Last candidate
○ Kth percentile in ranking
○ A random candidate
○ Request by user (answering a question like: “Why was I presented candidate x above
candidate y?”)
83
86. Pros & Cons
● Explains potentially very complex models
● Case-by-case analysis
○ Why do you think candidate x is a better match for my position?
○ Why do you think I am a better fit for this job?
○ Why am I being shown this ad?
○ Great for debugging real-time problems in production
● Global view is missing
○ Aggregate Contributions can be computed
○ Could be costly to compute
86
87. Lessons Learned and Next Steps
● Global explanations vs. Case-by-case Explanations
○ Global gives an overview, better for making modeling decisions
○ Case-by-case could be more useful for the non-technical user, better for debugging
● Integrated gradients worked well for us
○ Complex models make it harder for developers to map improvement to effort
○ Use-case gave intuitive results, on top of completely describing score differences
● Next steps
○ Global explanations for Deep Models
87
94. Impact of Piecewise Approach
● Target sample 𝑥 𝑘=(𝑥 𝑘1, 𝑥 𝑘2, ⋯)
● Top feature contributor
○ LIME: large magnitude of 𝛽 𝑘𝑗 ⋅ 𝑥 𝑘𝑗
○ xLIME: large magnitude of 𝛽 𝑘𝑗
− ⋅ 𝑥 𝑘𝑗
● Top positive feature influencer
○ LIME: large magnitude of 𝛽 𝑘𝑗
○ xLIME: large magnitude of negative 𝛽 𝑘𝑗
− or positive 𝛽 𝑘𝑗
+
● Top negative feature influencer
○ LIME: large magnitude of 𝛽 𝑘𝑗
○ xLIME: large magnitude of positive 𝛽 𝑘𝑗
− or negative 𝛽 𝑘𝑗
+
94
95. Localized Stratified Sampling: Idea
Method: Sampling based on empirical distribution around target value at each feature level
95
96. Localized Stratified Sampling: Method
● Sampling based on empirical distribution around target value for each feature
● For target sample 𝑥 𝑘 = (𝑥 𝑘1 , 𝑥 𝑘2 , ⋯), sampling values of feature 𝑗 according to
𝑝𝑗 (𝑋𝑗) ⋅ 𝑁(𝑥 𝑘𝑗 , (𝛼 ⋅ 𝑠𝑗 )2)
○ 𝑝𝑗 (𝑋𝑗) : empirical distribution.
○ 𝑥 𝑘𝑗 : feature value in target sample.
○ 𝑠𝑗 : standard deviation.
○ 𝛼 : Interpretable range: tradeoff between interpretable coverage and local accuracy.
● In LIME, sampling according to 𝑁(𝑥𝑗 , 𝑠𝑗
2).
96
_
98. LTS LCP (LinkedIn Career Page) Upsell
● A subset of churn data
○ Total Companies: ~ 19K
○ Company features: 117
● Problem: Estimate whether there will be upsell given a set of features about
the company’s utility from the product
98
102. Key Takeaways
● Looking at the explanation as contributor vs. influencer features is useful
○ Contributor: Which features end-up in the current outcome case-by-case
○ Influencer: What needs to be done to improve likelihood, case-by-case
● xLIME aims to improve on LIME via:
○ Piecewise linear regression: More accurately describes local point, helps with finding correct
influencers
○ Localized stratified sampling: More realistic set of local points
● Better captures the important features
102
118. Teams
● Search
● Feed
● Comments
● People you may know
● Jobs you may be interested in
● Notification
118
119. Case Study:
Integrated Gradients for Adversarial Analysis of
Question-Answering models
Ankur Taly** (Fiddler labs)
(Joint work with Mukund Sundararajan, Kedar
Dhamdhere, Pramod Mudrakarta)
119
**This research was carried out at Google Research
120. Tabular QA Visual QA Reading Comprehension
Peyton Manning became the first
quarterback ever to lead two different
teams to multiple Super Bowls. He is also
the oldest quarterback ever to play in a
Super Bowl at age 39. The past record
was held by John Elway, who led the
Broncos to victory in Super Bowl XXXIII at
age 38 and is currently Denver’s Executive
Vice President of Football Operations and
General Manager
Robustness question: Do these models understand the question? :-)
Kazemi and Elqursh (2017) model.
61.1% on VQA 1.0 dataset
(state of the art = 66.7%)
Neural Programmer (2017) model
33.5% accuracy on WikiTableQuestions
Yu et al (2018) model.
84.6 F-1 score on SQuAD (state of the art)
120
Q: How many medals did India win?
A: 197
Q: How symmetrical are the white
bricks on either side of the building?
A: very
Q: Name of the quarterback who
was 38 in Super Bowl XXXIII?
A: John Elway
121. Kazemi and Elqursh (2017) model.
Accuracy: 61.1% (state of the art: 66.7%)
Visual QA
Q: How symmetrical are the white bricks on either side of the building?
A: very
121
122. Visual QA
Q: How symmetrical are the white bricks on either side of the building?
A: very
Q: How asymmetrical are the white bricks on either side of the building?
A: very
122
Kazemi and Elqursh (2017) model.
Accuracy: 61.1% (state of the art: 66.7%)
123. Visual QA
Q: How symmetrical are the white bricks on either side of the building?
A: very
Q: How asymmetrical are the white bricks on either side of the building?
A: very
Q: How big are the white bricks on either side of the building?
A: very
123
Kazemi and Elqursh (2017) model.
Accuracy: 61.1% (state of the art: 66.7%)
124. Visual QA
Q: How symmetrical are the white bricks on either side of the building?
A: very
Q: How asymmetrical are the white bricks on either side of the building?
A: very
Q: How big are the white bricks on either side of the building?
A: very
Q: How fast are the bricks speaking on either side of the building?
A: very
124
Kazemi and Elqursh (2017) model.
Accuracy: 61.1% (state of the art: 66.7%)
125. Visual QA
Q: How symmetrical are the white bricks on either side of the building?
A: very
Q: How asymmetrical are the white bricks on either side of the building?
A: very
Q: How big are the white bricks on either side of the building?
A: very
Q: How fast are the bricks speaking on either side of the building?
A: very
125
Test/dev accuracy does not show us the
entire picture. Need to look inside!
Kazemi and Elqursh (2017) model.
Accuracy: 61.1% (state of the art: 66.7%)
126. Analysis procedure
● Attribute the answer (or answer selection logic) to question words
○ Baseline: Empty question, but full context (image, text, paragraph)
■ By design, attribution will not fall on the context
● Visualize attributions per example
● Aggregate attributions across examples
127. Visual QA attributions
Q: How symmetrical are the white bricks on either side of the building?
A: very
How symmetrical are the white bricks on
either side of the building?
red: high attribution
blue: negative attribution
gray: near-zero attribution
127
128. Over-stability [Jia and Liang, EMNLP 2017]
Jia & Liang note that:
● Image networks suffer from “over-sensitivity” to pixel perturbations
● Paragraph QA models suffer from “over-stability” to semantics-altering edits
Attributions show how such over-stability manifests in Visual QA, Tabular QA and
Paragraph QA networks
129. During inference, drop all words from the dataset except ones which are
frequently top attributions
● E.g. How many red buses are in the picture?
129
Visual QA
Over-stability
Top tokens: color, many, what, is, how, there, …
130. During inference, drop all words from the dataset except ones which are
frequently top attributions
● E.g. How many red buses are in the picture?
130
Top tokens: color, many, what, is, how, there, …
Visual QA
Over-stability
80% of final accuracy
reached with just 100 words
(Orig vocab size: 5305)
50% of final
accuracy with just
one word “color”
131. Attack: Subject ablation
Low-attribution nouns
'tweet',
'childhood',
'copyrights',
'mornings',
'disorder',
'importance',
'topless',
'critter',
'jumper',
'fits'
What is the man doing? → What is the tweet doing?
How many children are there? → How many tweet are there?
VQA model’s response remains the same 75.6% of the
time on questions that it originally answered correctly
131
Replace the subject of a question with a low-attribution noun from the vocabulary
● This ought to change the answer but often does not!
132. Many other attacks!
● Visual QA
○ Prefix concatenation attack (accuracy drop: 61.1% to 19%)
○ Stop word deletion attack (accuracy drop: 61.1% to 52%)
● Tabular QA
○ Prefix concatenation attack (accuracy drop: 33.5% to 11.4%)
○ Stop word deletion attack (accuracy drop: 33.5% to 28.5%)
○ Table row reordering attack (accuracy drop: 33.5 to 23%)
● Paragraph QA
○ Improved paragraph concatenation attacks of Jia and Liang from [EMNLP 2017]
Paper: Did the model understand the question? [ACL 2018]
134. Fiddler is an
explainable AI engine
designed for the enterprise
Pluggable Platform Explainable AI Trust & Governance Simplified Setup
Integrate, deploy,
visualize a wide variety
of custom models
Deliver clear decisions
and explanations to
your end users
Easy governed access
helps teams build and
understand Trusted AI
Lean and pluggable AI
platform with cloud or
on-prem integrations
135. 135
All your data
Any data warehouse
Custom Models
Fiddler Modeling Layer
Explainable AI for everyone
APIs, Dashboards, Reports, Trusted Insights
Fiddler - Explainable AI Engine
136. Benefits Across the Organization
Dev Ops Data Scientists
Customers Business Owner
More trust and customer
retention
DevOps and IT have greater
visibility into AI
Transparent AI workflow and GDPR
compliance
Data Scientists are more agile and
effective
138. Can explanations help build Trust?
● Can we know when the model is
uncertain?
● Does the model make the same
mistake as a human?
● Are we comfortable with the model?
*Zachary Lipton, et al. The Mythos of Model Interpretability ICML 2016
139. Can explanations help identify Causality?
● Predictions vs actions
● Explanations on why this happened as
opposed to how
*Zachary Lipton, et al. The Mythos of Model Interpretability ICML 2016
140. Can explanations be Transferable?
● Training and test setups often differ
from the wild
● Real world data is always changing
and noisy
*Zachary Lipton, et al. The Mythos of Model Interpretability ICML 2016
141. Can explanations provide more Information?
● Often times models aid human
decisions
● Extra bits of information other than
model decision could be valuable
*Zachary Lipton, et al. The Mythos of Model Interpretability ICML 2016
142. Challenges & Tradeoffs
142
User PrivacyTransparency
Fairness Performance
?
● Lack of standard interface for ML models
makes pluggable explanations hard
● Explanation needs vary depending on the type
of the user who needs it and also the problem
at hand.
● The algorithm you employ for explanations
might depend on the use-case, model type,
data format, etc.
● There are trade-offs w.r.t. Explainability,
Performance, Fairness, and Privacy.
143. Reflections
● Case studies on explainable AI in
practice
● Need “Explainability by Design” when
building AI products
143
144. Fairness Privacy
Transparency Explainability
Related KDD’19 sessions:
1.Tutorial: Fairness-Aware Machine Learning: Practical Challenges and Lessons Learned (Sun)
2.Workshop: Explainable AI/ML (XAI) for Accountability, Fairness, and Transparency (Mon)
3.Social Impact Workshop (Wed, 8:15 – 11:45)
4.Keynote: Cynthia Rudin, Do Simpler Models Exist and How Can We Find Them? (Thu, 8 - 9am)
5.Several papers on fairness (e.g., ADS7 (Thu, 10-12), ADS9 (Thu, 1:30-3:30))
6.Research Track Session RT17: Interpretability (Thu, 10am - 12pm)
144
145. Thanks! Questions?
● Feedback most welcome :-)
○ krishna@fiddler.ai, sgeyik@linkedin.com,
kkenthapadi@linkedin.com, vamithal@linkedin.com,
ankur@fiddler.ai
● Tutorial website: https://sites.google.com/view/kdd19-explainable-
ai-tutorial
● To try Fiddler, please send an email to info@fiddler.ai
145
Editor's Notes
Krishna to talk about the evolution of AI
Models have many interactions within
a business, multiple stakeholders, and varying requirements. Businesses need a
360-degree view of all models within a unified modeling universe that allows them to
understand how models affect one another and all aspects of business functions across
the organization.
Worth saying that some of the highlighted concerns are applicable even for decisions not involving humans.
Example: “Can I trust our AI decisions?” arises in domains with high stakes, e.g., deciding where to excavate/drill for mining/oil exploration;
understanding the predicted impact of certain actions (e.g., reducing carbon emission by xx%) on global warming.
Krishna
At the same time, explainability is important from compliance & legal perspective as well.
There have been several laws in countries such as the United States that prohibit discrimination based on “protected attributes” such as race, gender, age, disability status, and religion. Many of these laws have their origins in the Civil Rights Movement in the United States (e.g., US Civil Rights Act of 1964).
When legal frameworks prohibit the use of such protected attributes in decision making, there are usually two competing approaches on how this is enforced in practice: Disparate Treatment vs. Disparate Impact. Avoiding disparate treatment requires that such attributes should not be actively used as a criterion for decision making and no group of people should be discriminated against because of their membership in some protected group. Avoiding disparate impact requires that the end result of any decision making should result in equal opportunities for members of all protected groups irrespective of how the decision is made.
Please see NeurIPS’17 tutorial titled Fairness in machine learning by Solon Barocas and Moritz Hardt for a thorough discussion.
Renewed focus in light of privacy / algorithmic bias / discrimination issues observed recently with algorithmic systems.
Examples: EU General Data Protection Regulation (GDPR), which came into effect in May, 2018, and subsequent regulations such as California’s Consumer Privacy Act, which would take effect in January, 2020.
The focus is not only around privacy of users (as the name may indicate), but also on related dimensions such as algorithmic bias (or ensuring fairness), transparency, and explainability of decisions.
Image credit: https://pixabay.com/en/gdpr-symbol-privacy-icon-security-3499380/
European Commission Vice-President for the #DigitalSingleMarket.
Retain human decision in order to assign responsibility.
https://gdpr-info.eu/art-22-gdpr/
Bryce Goodman, Seth Flaxman, European Union regulations on algorithmic decision-making and a “right to explanation”
https://arxiv.org/pdf/1606.08813.pdf
Renewed focus in light of privacy / algorithmic bias / discrimination issues observed recently with algorithmic systems.
Examples: EU General Data Protection Regulation (GDPR), which came into effect in May, 2018, and subsequent regulations such as California’s Consumer Privacy Act, which would take effect in January, 2020.
The focus is not only around privacy of users (as the name may indicate), but also on related dimensions such as algorithmic bias (or ensuring fairness), transparency, and explainability of decisions.
Image credit: https://pixabay.com/en/gdpr-symbol-privacy-icon-security-3499380/
https://gdpr-info.eu/recitals/no-71/
Renewed focus in light of privacy / algorithmic bias / discrimination issues observed recently with algorithmic systems.
Examples: EU General Data Protection Regulation (GDPR), which came into effect in May, 2018, and subsequent regulations such as California’s Consumer Privacy Act, which would take effect in January, 2020.
The focus is not only around privacy of users (as the name may indicate), but also on related dimensions such as algorithmic bias (or ensuring fairness), transparency, and explainability of decisions.
Image credit: https://pixabay.com/en/gdpr-symbol-privacy-icon-security-3499380/
Convey that this is the Need
Purpose: Convey the need for adopting “explainability by design” approach when building AI products, that is, we should take explainability into account while designing & building AI/ML systems, as opposed to viewing these factors as an after-thought.
This criteria distinguishes whether interpretability is achieved by restricting the complexity of the machine learning model (intrinsic) or by applying methods that analyze the model after training (post hoc). Intrinsic interpretability refers to machine learning models that are considered interpretable due to their simple structure, such as short decision trees or sparse linear models. Post hoc interpretability refers to the application of interpretation methods after model training.
This criteria distinguishes whether interpretability is achieved by restricting the complexity of the machine learning model (intrinsic) or by applying methods that analyze the model after training (post hoc). Intrinsic interpretability refers to machine learning models that are considered interpretable due to their simple structure, such as short decision trees or sparse linear models. Post hoc interpretability refers to the application of interpretation methods after model training.
We ask this question for a training image classification network.
There are several different formulations of this Why question.
Explain the function captured by the network in the local vicinity of the network
Explain the entire function of the network
Explain each prediction in terms of training examples that influence it
And then there is attribution which is what we consider.
Some important methods not covered
Grad-CAM [ICCV 2017]
Grad-CAM++ [arxiv 2017]
SmoothGrad [arxiv 2017]
Point out correlation is not causation (yet again)
In such cases, it may be feasible to attribute to a small set of derived features, e.g., super pixels
Attributions are useful when the network behavior entails that a strict subset of input features are important
This criteria distinguishes whether interpretability is achieved by restricting the complexity of the machine learning model (intrinsic) or by applying methods that analyze the model after training (post hoc). Intrinsic interpretability refers to machine learning models that are considered interpretable due to their simple structure, such as short decision trees or sparse linear models. Post hoc interpretability refers to the application of interpretation methods after model training.
So why do need relevance debugging and explaining? The most obvious reason is we want to improve the performance of our relevance machine learning model. After days and weeks of data collection, feature engineering, model training and finally the model was deployed, we want to know how our machine learning models actually work in production. Does it work well with other pieces of the system? Does it work well on some edge cases? How can we improve it? If we only look at the metrics numbers, we won't be able to answer these questions.
By Improving the relevance machine learning models, we provide value to our members. Our members come to LinkedIn to look for people and jobs they are interested in, and to stay connected to their professional network. We want to provide the most relevant experience to our members.
In this way, we can build trust with our members. Our members would only trust us when we understand what we are doing. For example if we recommend an entry level job to a CEO, we better understand what's going on why we are making that mistake.
Now, let’s take a look at the architecture of our relevance system at LinkedIn. Let’s take search as an example.
Our members use their mobile apps or browser to find relevant information, the request goes through frontend API layer and will be sent to the federator layer. There are different types of entities the member maybe looking for, for example they could be looking for another member, or for a job. So the federator will send the query to the broker layer of different types of entities.
For each type of these entities, the data will be partitioned in to different shards, and each shards has a cluster of machines located in different data center. So the request goes to different shards and the search results will be fetched and finally returned to the member.
This is where machine learning models comes into play. The federator layer will try to understand what the user is looking for to decide which broker to query. It could be one or more brokers depending on the search keyword. So there’s a machine learning model in this layer to figure out the user’s intention. In the broker and shards layer, there are machine learning models to score the search results. There will be a first pass ranking and second pass ranking.
So what could go wrong in this diagram? First of all, the services could go down. The request could fail because of a bug or the query took too long and got timed out.
The data in each shard could be stale, the data doesn’t exists in our search index.
When it comes to the machine learning part,
1.The quality of the feature might not be good, sometimes the feature doesn’t really represent what we thought it should be.
2.The features could be inconsistent between offline training and online serving due to some issues with the feature generation pipeline.
3.When we ramp a new model, it doesn’t always work well for all the use cases, so that could also go wrong.
To debug these kind of issues could be hard.
First of all, the infrastructure is complex. As we have seen in the architecture, there are many services involved in this system, they are deployed in different hosts in different data centers, and most of the time, they are owned by different teams. So for an AI engineer, it could be hard for them to understand all the components in the system.
The issue might be hard to reproduce locally. The production environment could be very different from dev environment and it could be constantly changing.
Some of the issues could be time related, and these kind of issues is especially hard to debug. For example when a member report that he saw a feed item or received a notification that is not relevant to them, it could be days later when an AI engineer pick up the issue. And he would have no idea what was going on at that time. So basically they have no way to debug the issue.
Also, debugging the issues locally could be time consuming, you’ll need to set up local environment, deploy multiple service locally, get the permission to access production data to reproduce the issue and most of the time you won’t be able to reproduce. The whole process could take a long time and it’s really not productive.
So what we do is to provide the infrastructure to collect the debugging data and a web UI tool to query the system and visualize the data.
This is the overall architecture of the solution that we provide.
The Engineers will use the Web UI tool to query the relevance system.
The request sent out from the tool will identify itself as a debugging call, so when each service receives the request, it will collect the data for debugging for example the request and response of this service, some statistics about the query and most importantly the features and scores of each item.
And each of these services will send out the collected data using Kafka and the data will be stored in some kind of storage system. After the query has finished, the web UI tool will query the storage system and visualize the call graph and data to the engineer.
The reason we use kafka instead of returning the debug data in the response is the data size could be large and it affects the performance of the system. Also, if a service fails, we lost all the debugging data from other services.
Next, I’ll show you some UI parts of our debugging tool and how we can use them to debug the issues.
The call graph here shows how the query was sent to different services. The service call graph could be different depending on the search key word and search vertical. So it’s very important for our engineers to understand the call graph, and be able to see the request and response and what happened in each of these services. For example, if a service returned 500, we would like to see the hostname, the version of the service, and also the logs.
Sometimes the engineers not only want to see what was returned by the frontend API, they also want to debug a particular service in a particular data center or cluster. They could use our tools to specify which service or which host they want to debug and construct the query. In this way, they can tweak the request to see how does it change the results.
Our users could also see some statistics of the query, such as how long does each phases take and how efficient is the query. This could help them to locate the issues or improve latency of the system.
Sometimes the engineers would get questions like “Why am I seeing this feed item”, ”Why is this search result ranked in the first place”. To answer these questions, they would like to look at the features of these items. The issue usually happens when they are experimenting a new feature in the model, and the quality of the feature doesn’t meet their expectation and doesn’t perform well in the online system. In this case, the engineers would be interested in this particular new feature to see if the value is abnormal.
Sometimes the issue happens when there’s a new model being ramped in production. Since we could have multiple models deployed for the A/B tests, the engineers would also like to know what is the model being used at the time of scoring.
This is also useful because sometimes a model has a failure and it falls back to a default model, they would also want to confirm which one was used.
Our engineers would also get the questions such as “Why am I not seeing something”. It would be very helpful to see what has been returned and what has been dropped in each layer and to see how was this item re-ranked. Sometimes the data doesn’t exists in our online store, sometimes the item are dropped because the score is not high enough. This would help the engineers to find out the root cause quickly.
Here’re some more advanced and complex use cases when it comes to debugging and explaining relevance systems. We’ll now go over each one.
The first use case is system perturbation. The idea is simple: I want to change something in the system and see how the results change. For example, I want to confirm a bug fix before making code changes, or just to observe how a new model would work before exposing it to traffic. So how does it work? First, the engineer uses our tool to specify what kind of perturbation he or she wants to perform, which is injected as part of the request. Our tool then passes the perturbation to downstream services, overrides the system behavior, and surfaces the results back to the engineer.
Some examples of perturbation include:
-To override settings or treatments of online A/B tests
-To select a new machine learning model and sanity check its results before ramping up on traffic
To override feature values and see if an item would be ranked differently in the search results
Another advanced use case is side-by-side comparison:
-We want to compare the results of 2 different machine learning models
Or we may want to compare features and ranking scores of 2 items. The 2 items can come from a single query (to understand their relative ranking), or from 2 different queries (to understand differences in models).
Here’s the user interface when comparing results from 2 different queries. The UI shows changes in ranking positions of the same item. As you can see in this example, Query 1 on the left side, using a different model than Query 2 on the right side, ranked the Lead Software Engineer job higher.
And why do the models rank the same item differently? Our tool can compare this item from 2 queries and show the difference in the ranking process of the 2 models. As you can see in the table, this UI highlights the differences in feature values, both raw and computed by the models. AI engineers use this comparison information to understand model behavior and identify potential problems or improvements.
The last use case is Replay. Scoring and ranking relevant information could be time sensitive. For example, we may get a complaint about old items showing up in a member’s feed. To debug the problem, we need to reproduce the feed items returned by the model at a specific timepoint in the past. Our tool provides the capability of replaying past model inferences.
How does it work? First, when making model inferences, the online system emits features and ranking cores using Kafka. These Kafka events are then consumed by our tool using Samza and stored in database. Later, AI engineers can query the database for historical data when debugging.
This is the UI for Feed Replay. On the left side, an AI engineer can query sessions of a given member within a time range. And when a session is selected, on the right side are the list of feed items shown to the member during that session.
So we have talked about many different use cases, and I would like add a note on security and privacy. As you have seen earlier, debugging and explaining the relevance systems require access to personalized data. And we have always made privacy concerns a central part of the tool. Like the rest of LinkedIn, our debugging tool is GDPR compliant. We control and audit all access. Debugging data has a limited retention time of about a couple of weeks. And the replay capability from the previous slide only works for our internal employees. With the debugging tool, we give our engineers as much visibility as possible while protecting our members’ privacy.
Here’re a list of AI teams that are using our tool on a daily basis: search, feed, comments, etc. As AI becomes more and more prevalent in our daily life, it’s important to understand the models and debug them when they don’t behave as expected. We hope our talk has been helpful, and we’re happy to answer questions. Thank you!
Have these models read the question carefully? Can we say “yes”, because they achieve good accuracy? That is the question we wish to address in this paper. Let us start by looking at an example.
Completely changing the question’s meaning is still not chanigng the behaviour of the model. Clearly, the model has not fully grasped the question.
Completely changing the question’s meaning is still not chanigng the behaviour of the model. Clearly, the model has not fully grasped the question.
Completely changing the question’s meaning is still not chanigng the behaviour of the model. Clearly, the model has not fully grasped the question.
Completely changing the question’s meaning is still not chanigng the behaviour of the model. Clearly, the model has not fully grasped the question.
Completely changing the question’s meaning is still not chanigng the behaviour of the model. Clearly, the model has not fully grasped the question.
TODO: Bold + color:gray
You may feel that Integration and Gradients are inverse operations of each other, and we did nothing.
We encourage all of you to check out KDD’19 sessions/tutorials/talks related to these topics:
Tutorial: Fairness-Aware Machine Learning: Practical Challenges and Lessons Learned (Sun)
Workshop: Explainable AI/ML (XAI) for Accountability, Fairness, and Transparency (Mon)
Social Impact Workshop (Wed, 8:15 – 11:45)
Keynote: Cynthia Rudin, Do Simpler Models Exist and How Can We Find Them? (Thu, 8 - 9am)
Several papers on fairness (e.g., ADS7 (Thu, 10-12), ADS9 (Thu, 1:30-3:30))
Research Track Session RT17: Interpretability (Thu, 10am - 12pm)