
Instrumental Variable Value Iteration for Causal Offline Reinforcement Learning
In offline reinforcement learning (RL) an optimal policy is learnt solel...
read it

Reinforcement Learning with Algorithms from Probabilistic Structure Estimation
Reinforcement learning (RL) algorithms aim to learn optimal decisions in...
read it

Definition and evaluation of modelfree coordination of electrical vehicle charging with reinforcement learning
Initial DR studies mainly adopt model predictive control and thus requir...
read it

A Lyapunov Theory for FiniteSample Guarantees of Asynchronous QLearning and TDLearning Variants
This paper develops an unified framework to study finitesample converge...
read it

OrderingBased Causal Discovery with Reinforcement Learning
It is a longstanding question to discover causal relations among a set ...
read it

Reinforcement Learning for Adaptive Mesh Refinement
Largescale finite element simulations of complex physical systems gover...
read it

How much does your data exploration overfit? Controlling bias via information usage
Modern data is messy and highdimensional, and it is often not clear a p...
read it
Causal Reinforcement Learning: An Instrumental Variable Approach
In the standard data analysis framework, data is first collected (once for all), and then data analysis is carried out. With the advancement of digital technology, decisionmakers constantly analyze past data and generate new data through the decisions they make. In this paper, we model this as a Markov decision process and show that the dynamic interaction between data generation and data analysis leads to a new type of bias – reinforcement bias – that exacerbates the endogeneity problem in standard data analysis. We propose a class of instrument variable (IV)based reinforcement learning (RL) algorithms to correct for the bias and establish their asymptotic properties by incorporating them into a twotimescale stochastic approximation framework. A key contribution of the paper is the development of new techniques that allow for the analysis of the algorithms in general settings where noises feature timedependency. We use the techniques to derive sharper results on finitetime trajectory stability bounds: with a polynomial rate, the entire future trajectory of the iterates from the algorithm fall within a ball that is centered at the true parameter and is shrinking at a (different) polynomial rate. We also use the technique to provide formulas for inferences that are rarely done for RL algorithms. These formulas highlight how the strength of the IV and the degree of the noise's time dependency affect the inference.
READ FULL TEXT
Comments
There are no comments yet.