Internship proposal: “Sequential Learning in Path Planning under Delayed and Anonymous Feedback" Supervisors: Alonso SILVA (Nokia Bell Labs France) Dong Quan VU (Nokia Bell Labs France & EDITE UPMC)
Background and topic description:
The multiarmed bandit is an elegant model to solve sequential prediction problems where a learner needs to choose at each time step t = 1,...,T an action (also called arm) in a given set, then he suffers a loss and observes a feedback corresponding to that chosen action. The objective of the learner is to guarantee that the accumulated loss is not much larger than that of the best fixed action in hindsight (that is, to minimize the regret). In this work, we focus on a bandit problem with combinatorial structure, namely Path Planning; where actions correspond to paths on acyclic directed graphs, and the loss on a chosen path is the summation of losses on edges contained in that path.
The classical algorithms for Karms bandits problems in the literature (see e.g., EXP3 in [1]) offer an upper bound of the expected regret by a (sublinearly) polynomial term on the cardinality of the action set. These bounds are inefficient in the case of Path Planning due to the fact that the considering graph typically contains an exponential number of paths. However, by exploiting the specific structure of the graph, we can use the exponential average weights algorithms (e.g., EXP2 in [4, 5]) and bound the regret in terms of the dimension of the representative space (typically a smaller number, e.g., the number of edges). The computational issues of these algorithms remain as an open question that will be attentively studied in this work.
Another aspect that will be considered is the feedback that the learner observes at every time step. Classically, the learner can receive either bandit feedback (he only observes the incurred loss by the chosen action) or either fullinformation feedback (he observes losses of all actions). Recently, several new settings of feedback are proposed by [2, 3] that model realistic problems:
Delayed feedback: The loss of time t will only be observed at time t + d for d > 0 fixed.
Anonymous composite feedback: At time t, the learner only observes the total sum of the loss from time steps t−d+1,t−d,...,t for d > 0 fixed.
Several techniques based on EXP3 algorithm are presented to solve these two new settings. The natural arising question is whether these techniques can be extended into combinatorial bandits, particularly the Path Planning problem, with EXP2 algorithm of [4, 5]. The implementation of these algorithms to conduct computational experiments is another important objective.
Some potential application of this work include advertising on social networks, scheduling data transmission on a network, decision making in recommendation systems, etc.
Further information and application procedure: Candidates should have a strong background in mathematics (preferentially in either sequential learning or game theory/optimization or both), be competent in scientific communication in English and having basic programming experience (Python, R or C++, etc.). Interested candidates are invited to send the following documents to
This email address is being protected from spam bots, you need JavaScript enabled to view it
and
This email address is being protected from spam bots, you need JavaScript enabled to view it
:
• A detailed CV • The most recent academic grades and any other useful information for the application • The name of 12 references willing to provide a recommendation letter
The internship is fully funded and will be mainly based in Nokia Bell Labs France research center (Nozay  Paris region) and LINCS (Paris 13ème).
References [1] Peter Auer, Nicolo CesaBianchi, Yoav Freund, and Robert E Schapire, The nonstochastic multiarmed bandit problem, SIAM journal on computing 32 (2002), no. 1, 48–77. [2] Nicolo CesaBianchi, Claudio Gentile, and Yishay Mansour, Nonstochastic bandits with composite anonymous feedback. [3] Nicolo CesaBianchi, Claudio Gentile, Yishay Mansour, and Alberto Minora, Delay and cooperation in nonstochastic bandits, Journal of machine learning research 49 (2016), 605–622. [4] Nicolo CesaBianchi and Gábor Lugosi, Combinatorial bandits, Journal of Computer and System Sciences 78 (2012), no. 5, 1404–1422. [5] Varsha Dani, Sham M Kakade, and Thomas P Hayes, The price of bandit information for online optimization, Advances in Neural Information Processing Systems, 2008, pp. 345–352.
