Contextual Multi-Armed Bandits

Tyler Lu, David Pal, Martin Pal ; JMLR W&CP 9:485-492, 2010.

Abstract

We study contextual multi-armed bandit problems where the context comes from a metric space and the payoff satisfies a Lipschitz condition with respect to the metric. Abstractly, a contextual multi-armed bandit problem models a situation where, in a sequence of independent trials, an online algorithm chooses, based on a given context (side information), an action from a set of possible actions so as to maximize the total payoff of the chosen actions. The payoff depends on both the action chosen and the context. In contrast, context-free multi-armed bandit problems, a focus of much previous research, model situations where no side information is available and the payoff depends only on the action chosen. Our problem is motivated by sponsored web search, where the task is to display ads to a user of an Internet search engine based on her search query so as to maximize the click-through rate (CTR) of the ads displayed. We cast this problem as a contextual multi-armed bandit problem where queries and ads form metric spaces and the payoff function is Lipschitz with respect to both the metrics. For any $\epsilon > 0$ we present an algorithm with regret $O(T^{\frac{a+b+1}{a+b+2} + \epsilon})$ where $a,b$ are the covering dimensions of the query space and the ad space respectively. We prove a lower bound $\Omega(T^{\frac{\tilde{a}+\tilde{b}+1}{\tilde{a}+\tilde{b}+2} \epsilon})$ for the regret of any algorithm where $\tilde{a}, \tilde{b}$ are packing dimensions of the query spaces and the ad space respectively. For finite spaces or convex bounded subsets of Euclidean spaces, this gives an almost matching upper and lower bound.



Home Page

Papers

Submissions

News

Scope

Editorial Board

Announcements

Proceedings

Open Source Software

Search

Login



RSS Feed

Page last modified on Wed Mar 24 15:36 GMT 2010.

Copyright @ JMLR 2000. All rights reserved.