SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Downloaden Sie, um offline zu lesen
Houston Machine Learning
All About Bandits!
4/13/2019
Neilkunal Panchal
Outline
• k-armed Bandits
• Action Value Methods
• Tracking a non-stationary problem
• Optimistic initial values
• Upper confidence bound action selections
• Gradient Bandit Algorithms
• Contextual Bandits
• Thomson Sampling
Introduction
• Chapter 2 Reinforcement Learning
and introduction (Sutton and Barto
2017)
• http://incompleteideas.net/book/bo
okdraft2017nov5.pdf
k – armed Bandit
0 0 0
At = 1 At = 2 At = 3 At = k
Goal: Maximise expected total reward over 1000 actions or time steps
04 0 0 0030 0 0010 0 003
Action At:
Rewards Rt
Value of arbitrary action a Expected reward R of action A
note: if we knew q*(a) , selecting a would be trivial. Let Qt(a) be the estimated value function. We would like
Qt(a) ~ q*(a)
k-armed Bandits – Exploration and Exploitation
• Given an estimate of Qt(a) the estimated value function.
• The greedy action is:
a = argmax(Qt(a) )
• When selecting a greedy action. We are exploiting
• When selecting a non- greedy action we are exploring
• Exploring allow us to obtain a better estimate of Qt(a)
• Exploring allows us to obtain better values of a in the long run
• Exploiting is the right thing to do in the short run
Exploration – Exploitation Conflict: Given a single action it not possible to simultaneously explore
and exploit. One must balance short term and long term rewards
Action Value Methods epsilon- Greedy
1 if Ai = a
0 otherwise
• Look at methods to better estimate Qt(a)
• Estimate Q by rewards actually received.
Tends to q*(a) as denominator goes to infinity
• Greedy action selection is given by:
Argmax selects the argument which
maximises Q
• Epsilon- Greedy action selection is given by:
10 arm test bed for randomly generated bandit problems
Value q*(i) selected
From from
Gaussian Distribution
~ N(0,1)
Reward R selected
From from
Normal Distribution
~ N (0,1)
1 run is 2000 actions
And test is 1000 runs
10 arm test bed for randomly generated bandit problems
Incremental Approach
• How to implement value estimate average Q in a more computationally efficient way?
Estimate of Action Value
After n-1 trials
ith Reward
• The memory requirement grows with n
Tracking a non-stationary problem
• The problems we have encountered so far assume a stationary bandit. (Reward probabilities do
not change)
• How do we deal with non-stationary bandits?
• Here we can weight recent rewards more
Constant step size
Exponential recenty weighted average
• General condition for convergence:
Steps are large enough to overcome
initial conditions or random variations
The steps become small enough to
eventually converge
• Condition met for 1/n but not for constant step size. But desired for stationary environments
Optimistic initial values
• A technique for encouraging initial exploration is called optimistic initial values
• Consider the test bed but with the initial estimate for the action value for k = 1 to Q1 = 5
Fast increase in initial
exploration
Improved performance
over epsilon greedy
• Challenge: Only useful for stationary problems. Requires hand selected hyperparameters, and for
larger t becomes less relevent
Upper confidence bound action selections
Epsilon greedy is a good method for exploration. But doesn’t take into account uncertainty of
estimates. Nor does it explore from near optimal actions.
An improvement is to select actions according to:
Number of times an action is
selected and a measure of
uncertainty
C controls degree of exploration
Improved performance
over epsilon greedy
Works well for bandit
and MCTS isn’t generally
used for RL
Gradient Bandit Algorithms
So far we have considered Q  a methods.
In gradient bandits and numerical preference Ht(a) is given for each action
Given Ht(a) the max action is given by the softmax:
Probability of
taking action a at
time t
Initially all preferences
are the same(
equiprobable)
A natural learning algorithm (based on SGD):
Learning rate
Average Reward (Baseline)
Baseline improves
performance
Contextual Bandits
0 0 004 0 0 0030 0 0010 0 003
0 0 004 0 0 0030 0 0010 0 003
Rewards Rt
Rewards Rt
Policy
Policy
Policy
Thomson Sampling
• https://towardsdatascience.com/solving-multiarmed-bandits-a-comparison-of-epsilon-greedy-and-
thompson-sampling-d97167ca9a50
The Thompson sampler differs quite fundamentally from the e-greedy algorithm in three
major ways:
• it is not greedy;
• its exploration is more sophisticated, and;
• it is Bayesian
• Set a uniform prior distribution between 0 and 1 for each variant k’s payout rate
• Draw parameters theta from each of k’s posterior distribution
• Select the variant k which is the highest parameter theta
• Observe the distribution parameters
Summary
k-armed Bandits
Action Value Methods
Tracking a non-stationary
problem
Optimistic initial values
Upper confidence bound action
selections
Gradient Bandit Algorithms
Contextual Bandits
Thomson Sampling

Weitere ähnliche Inhalte

Was ist angesagt?

Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsSeung Jae Lee
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningSeung Jae Lee
 
Contextual Bandit Survey
Contextual Bandit SurveyContextual Bandit Survey
Contextual Bandit SurveySangwoo Mo
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsReinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsSeung Jae Lee
 
Rl chapter 1 introduction
Rl chapter 1 introductionRl chapter 1 introduction
Rl chapter 1 introductionConnorShorten2
 
Reinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step BootstrappingReinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step BootstrappingSeung Jae Lee
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processJie-Han Chen
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Deep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement LearningDeep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement Learningdeawoo Kim
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning Melaku Eneayehu
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learningJie-Han Chen
 
Déjà Vu: The Importance of Time and Causality in Recommender Systems
Déjà Vu: The Importance of Time and Causality in Recommender SystemsDéjà Vu: The Importance of Time and Causality in Recommender Systems
Déjà Vu: The Importance of Time and Causality in Recommender SystemsJustin Basilico
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision processVARUN KUMAR
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsJaya Kawale
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIMikko Mäkipää
 

Was ist angesagt? (20)

Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo Methods
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference Learning
 
Contextual Bandit Survey
Contextual Bandit SurveyContextual Bandit Survey
Contextual Bandit Survey
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsReinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular Methods
 
Rl chapter 1 introduction
Rl chapter 1 introductionRl chapter 1 introduction
Rl chapter 1 introduction
 
Reinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step BootstrappingReinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step Bootstrapping
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)
 
Deep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement LearningDeep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement Learning
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Déjà Vu: The Importance of Time and Causality in Recommender Systems
Déjà Vu: The Importance of Time and Causality in Recommender SystemsDéjà Vu: The Importance of Time and Causality in Recommender Systems
Déjà Vu: The Importance of Time and Causality in Recommender Systems
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision process
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in Recommendations
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
 

Ähnlich wie Introduction to Multi-armed Bandits

DRL #2-3 - Multi-Armed Bandits .pptx.pdf
DRL #2-3 - Multi-Armed Bandits .pptx.pdfDRL #2-3 - Multi-Armed Bandits .pptx.pdf
DRL #2-3 - Multi-Armed Bandits .pptx.pdfGulamSarwar31
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
Week 13 Feature Selection Computer Vision Bagian 2
Week 13 Feature Selection Computer Vision Bagian 2Week 13 Feature Selection Computer Vision Bagian 2
Week 13 Feature Selection Computer Vision Bagian 2khairulhuda242
 
Cs221 lecture8-fall11
Cs221 lecture8-fall11Cs221 lecture8-fall11
Cs221 lecture8-fall11darwinrlo
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structuresYoav chernobroda
 
Introduction to FAST-LAMP
Introduction to FAST-LAMPIntroduction to FAST-LAMP
Introduction to FAST-LAMPThien Q. Tran
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptxRithikRaj25
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLJie-Han Chen
 
Sparsenet
SparsenetSparsenet
Sparsenetndronen
 
Introduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmIntroduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmKatsuki Ohto
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Julia Maddalena
 

Ähnlich wie Introduction to Multi-armed Bandits (20)

DRL #2-3 - Multi-Armed Bandits .pptx.pdf
DRL #2-3 - Multi-Armed Bandits .pptx.pdfDRL #2-3 - Multi-Armed Bandits .pptx.pdf
DRL #2-3 - Multi-Armed Bandits .pptx.pdf
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Greedy algorithm
Greedy algorithmGreedy algorithm
Greedy algorithm
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Week 13 Feature Selection Computer Vision Bagian 2
Week 13 Feature Selection Computer Vision Bagian 2Week 13 Feature Selection Computer Vision Bagian 2
Week 13 Feature Selection Computer Vision Bagian 2
 
Cs221 lecture8-fall11
Cs221 lecture8-fall11Cs221 lecture8-fall11
Cs221 lecture8-fall11
 
Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep RL.pdf
 
Slideshare
SlideshareSlideshare
Slideshare
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
 
Introduction to FAST-LAMP
Introduction to FAST-LAMPIntroduction to FAST-LAMP
Introduction to FAST-LAMP
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RL
 
Sparsenet
SparsenetSparsenet
Sparsenet
 
Introduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmIntroduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithm
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning
 

Mehr von Yan Xu

Kaggle winning solutions: Retail Sales Forecasting
Kaggle winning solutions: Retail Sales ForecastingKaggle winning solutions: Retail Sales Forecasting
Kaggle winning solutions: Retail Sales ForecastingYan Xu
 
Basics of Dynamic programming
Basics of Dynamic programming Basics of Dynamic programming
Basics of Dynamic programming Yan Xu
 
Walking through Tensorflow 2.0
Walking through Tensorflow 2.0Walking through Tensorflow 2.0
Walking through Tensorflow 2.0Yan Xu
 
Practical contextual bandits for business
Practical contextual bandits for businessPractical contextual bandits for business
Practical contextual bandits for businessYan Xu
 
A Data-Driven Question Generation Model for Educational Content - by Jack Wang
A Data-Driven Question Generation Model for Educational Content - by Jack WangA Data-Driven Question Generation Model for Educational Content - by Jack Wang
A Data-Driven Question Generation Model for Educational Content - by Jack WangYan Xu
 
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...Yan Xu
 
Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...
Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...
Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...Yan Xu
 
Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...
Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...
Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...Yan Xu
 
Introduction to Autoencoders
Introduction to AutoencodersIntroduction to Autoencoders
Introduction to AutoencodersYan Xu
 
State of enterprise data science
State of enterprise data scienceState of enterprise data science
State of enterprise data scienceYan Xu
 
Long Short Term Memory
Long Short Term MemoryLong Short Term Memory
Long Short Term MemoryYan Xu
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationYan Xu
 
Linear algebra and probability (Deep Learning chapter 2&3)
Linear algebra and probability (Deep Learning chapter 2&3)Linear algebra and probability (Deep Learning chapter 2&3)
Linear algebra and probability (Deep Learning chapter 2&3)Yan Xu
 
HML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep LearningHML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep LearningYan Xu
 
Secrets behind AlphaGo
Secrets behind AlphaGoSecrets behind AlphaGo
Secrets behind AlphaGoYan Xu
 
Optimization in Deep Learning
Optimization in Deep LearningOptimization in Deep Learning
Optimization in Deep LearningYan Xu
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkYan Xu
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network Yan Xu
 
Introduction to Neural Network
Introduction to Neural NetworkIntroduction to Neural Network
Introduction to Neural NetworkYan Xu
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reductionYan Xu
 

Mehr von Yan Xu (20)

Kaggle winning solutions: Retail Sales Forecasting
Kaggle winning solutions: Retail Sales ForecastingKaggle winning solutions: Retail Sales Forecasting
Kaggle winning solutions: Retail Sales Forecasting
 
Basics of Dynamic programming
Basics of Dynamic programming Basics of Dynamic programming
Basics of Dynamic programming
 
Walking through Tensorflow 2.0
Walking through Tensorflow 2.0Walking through Tensorflow 2.0
Walking through Tensorflow 2.0
 
Practical contextual bandits for business
Practical contextual bandits for businessPractical contextual bandits for business
Practical contextual bandits for business
 
A Data-Driven Question Generation Model for Educational Content - by Jack Wang
A Data-Driven Question Generation Model for Educational Content - by Jack WangA Data-Driven Question Generation Model for Educational Content - by Jack Wang
A Data-Driven Question Generation Model for Educational Content - by Jack Wang
 
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
 
Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...
Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...
Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...
 
Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...
Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...
Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...
 
Introduction to Autoencoders
Introduction to AutoencodersIntroduction to Autoencoders
Introduction to Autoencoders
 
State of enterprise data science
State of enterprise data scienceState of enterprise data science
State of enterprise data science
 
Long Short Term Memory
Long Short Term MemoryLong Short Term Memory
Long Short Term Memory
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
 
Linear algebra and probability (Deep Learning chapter 2&3)
Linear algebra and probability (Deep Learning chapter 2&3)Linear algebra and probability (Deep Learning chapter 2&3)
Linear algebra and probability (Deep Learning chapter 2&3)
 
HML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep LearningHML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep Learning
 
Secrets behind AlphaGo
Secrets behind AlphaGoSecrets behind AlphaGo
Secrets behind AlphaGo
 
Optimization in Deep Learning
Optimization in Deep LearningOptimization in Deep Learning
Optimization in Deep Learning
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network
 
Introduction to Neural Network
Introduction to Neural NetworkIntroduction to Neural Network
Introduction to Neural Network
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reduction
 

Kürzlich hochgeladen

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Kürzlich hochgeladen (20)

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

Introduction to Multi-armed Bandits

  • 1. Houston Machine Learning All About Bandits! 4/13/2019 Neilkunal Panchal
  • 2. Outline • k-armed Bandits • Action Value Methods • Tracking a non-stationary problem • Optimistic initial values • Upper confidence bound action selections • Gradient Bandit Algorithms • Contextual Bandits • Thomson Sampling
  • 3. Introduction • Chapter 2 Reinforcement Learning and introduction (Sutton and Barto 2017) • http://incompleteideas.net/book/bo okdraft2017nov5.pdf
  • 4. k – armed Bandit 0 0 0 At = 1 At = 2 At = 3 At = k Goal: Maximise expected total reward over 1000 actions or time steps 04 0 0 0030 0 0010 0 003 Action At: Rewards Rt Value of arbitrary action a Expected reward R of action A note: if we knew q*(a) , selecting a would be trivial. Let Qt(a) be the estimated value function. We would like Qt(a) ~ q*(a)
  • 5. k-armed Bandits – Exploration and Exploitation • Given an estimate of Qt(a) the estimated value function. • The greedy action is: a = argmax(Qt(a) ) • When selecting a greedy action. We are exploiting • When selecting a non- greedy action we are exploring • Exploring allow us to obtain a better estimate of Qt(a) • Exploring allows us to obtain better values of a in the long run • Exploiting is the right thing to do in the short run Exploration – Exploitation Conflict: Given a single action it not possible to simultaneously explore and exploit. One must balance short term and long term rewards
  • 6. Action Value Methods epsilon- Greedy 1 if Ai = a 0 otherwise • Look at methods to better estimate Qt(a) • Estimate Q by rewards actually received. Tends to q*(a) as denominator goes to infinity • Greedy action selection is given by: Argmax selects the argument which maximises Q • Epsilon- Greedy action selection is given by:
  • 7. 10 arm test bed for randomly generated bandit problems Value q*(i) selected From from Gaussian Distribution ~ N(0,1) Reward R selected From from Normal Distribution ~ N (0,1) 1 run is 2000 actions And test is 1000 runs
  • 8. 10 arm test bed for randomly generated bandit problems
  • 9. Incremental Approach • How to implement value estimate average Q in a more computationally efficient way? Estimate of Action Value After n-1 trials ith Reward • The memory requirement grows with n
  • 10. Tracking a non-stationary problem • The problems we have encountered so far assume a stationary bandit. (Reward probabilities do not change) • How do we deal with non-stationary bandits? • Here we can weight recent rewards more Constant step size Exponential recenty weighted average • General condition for convergence: Steps are large enough to overcome initial conditions or random variations The steps become small enough to eventually converge • Condition met for 1/n but not for constant step size. But desired for stationary environments
  • 11. Optimistic initial values • A technique for encouraging initial exploration is called optimistic initial values • Consider the test bed but with the initial estimate for the action value for k = 1 to Q1 = 5 Fast increase in initial exploration Improved performance over epsilon greedy • Challenge: Only useful for stationary problems. Requires hand selected hyperparameters, and for larger t becomes less relevent
  • 12. Upper confidence bound action selections Epsilon greedy is a good method for exploration. But doesn’t take into account uncertainty of estimates. Nor does it explore from near optimal actions. An improvement is to select actions according to: Number of times an action is selected and a measure of uncertainty C controls degree of exploration Improved performance over epsilon greedy Works well for bandit and MCTS isn’t generally used for RL
  • 13. Gradient Bandit Algorithms So far we have considered Q  a methods. In gradient bandits and numerical preference Ht(a) is given for each action Given Ht(a) the max action is given by the softmax: Probability of taking action a at time t Initially all preferences are the same( equiprobable) A natural learning algorithm (based on SGD): Learning rate Average Reward (Baseline) Baseline improves performance
  • 14. Contextual Bandits 0 0 004 0 0 0030 0 0010 0 003 0 0 004 0 0 0030 0 0010 0 003 Rewards Rt Rewards Rt Policy Policy Policy
  • 15. Thomson Sampling • https://towardsdatascience.com/solving-multiarmed-bandits-a-comparison-of-epsilon-greedy-and- thompson-sampling-d97167ca9a50 The Thompson sampler differs quite fundamentally from the e-greedy algorithm in three major ways: • it is not greedy; • its exploration is more sophisticated, and; • it is Bayesian • Set a uniform prior distribution between 0 and 1 for each variant k’s payout rate • Draw parameters theta from each of k’s posterior distribution • Select the variant k which is the highest parameter theta • Observe the distribution parameters
  • 16. Summary k-armed Bandits Action Value Methods Tracking a non-stationary problem Optimistic initial values Upper confidence bound action selections Gradient Bandit Algorithms Contextual Bandits Thomson Sampling