## Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks

*Swapna Buccapatnam, Fang Liu, Atilla Eryilmaz, Ness B. Shroff*; 18(216):1−34, 2018.

### Abstract

We study the stochastic multi-armed bandit (MAB) problem in the
presence of side-observations across actions that occur as a
result of an underlying network structure. In our model, a
bipartite graph captures the relationship between actions and a
common set of unknowns such that choosing an action reveals
observations for the unknowns that it is connected to. This
models a common scenario in online social networks where users
respond to their friends' activity, thus providing side
information about each other's preferences. Our contributions
are as follows: 1) We derive an asymptotic lower bound (with
respect to time) as a function of the bi-partite network
structure on the regret of any uniformly good policy that
achieves the maximum long-term average reward. 2) We propose two
policies - a randomized policy; and a policy based on the well-
known upper confidence bound (UCB) policies - both of which
explore each action at a rate that is a function of its network
position. We show, under mild assumptions, that these policies
achieve the asymptotic lower bound on the regret up to a
multiplicative factor, independent of the network structure.
Finally, we use numerical examples on a real-world social
network and a routing example network to demonstrate the
benefits obtained by our policies over other existing policies.

[abs][pdf][bib]