The multi-armed bandit problem is characterized by a repeated decision of selecting an arm amongst a set of arms with unknown payoff probabilities; the payoff probabilities may remain constant or change over time. An agent gains information about the payoff probabilities from past actions and aims to learn an optimal arm selection strategy for maximal payoff. The two-armed Bernoulli bandit (TABB) problem is a special class of multi-armed bandit problems where there are exactly two arms with payoff structures that follow Bernoulli distributions. Existing approaches to the TABB primarily rely on perfect recall of past actions to generate estimates for arm payoff probabilities; it is further assumed that the decision maker knows a priori whether arm payoff probabilities can change. We present a different approach based on finite automata which demonstrates that an agent can learn a low regret strategy without knowing whether arm payoff probabilities are static or dynamic and without having perfect recall of past actions. Roughly speaking, the automaton works by maintaining a relative ranking of arms based on payoff probabilities rather than estimating precise payoff probabilities.
Download the full paper here.