Loading [MathJax]/jax/output/HTML-CSS/jax.js

Delay and Cooperation in Nonstochastic Bandits

Nicolò Cesa-Bianchi, Claudio Gentile, Yishay Mansour.

Year: 2019, Volume: 20, Issue: 17, Pages: 1−38


Abstract

We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than d hops to arrive, where d is a delay parameter. We introduce Exp3-Coop, a cooperative version of the Exp3 algorithm and prove that with K actions and N agents the average per-agent regret after T rounds is at most of order (d+1+KNαd)(TlnK), where αd is the independence number of the d-th power of the communication graph G. We then show that for any connected graph, for d=K the regret bound is K1/4T, strictly better than the minimax regret KT for noncooperating agents. More informed choices of d lead to bounds which are arbitrarily close to the full information minimax regret TlnK when G is dense. When G has sparse components, we show that a variant of Exp3-Coop, allowing agents to choose their parameters according to their centrality in G, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay.

PDF BibTeX