TransWikia.com

Correct dimensionality of parameter vector for solving an MRP with linear function approximation?

Artificial Intelligence Asked by soitgoes on August 24, 2021

I’m in the process of trying to learn more about RL by shadowing a course offered collaboratively by UCL and DeepMind that has been made available to the public. I’m most of the way through the course, which for auditors consists of a Youtube playlist, copies of the Jupyter notebooks used for homework assigments (thanks to some former students making them public on Github), and reading through Sutton and Barto’s wonderful book Reinforcement Learning: An Introduction (2nd edition).

I’ve gone a little more than half of the book and corresponding course material at this point, thankfully with the aid of public solutions for the homework assignments and textbook exercises which have allowed me to see which parts of my own work that I’ve done incorrectly. Unfortunately, I’ve been unable to find such a resource for the last homework assignment offered and so I’m hoping one of the many capable people here might be able to explain parts of the following question to me.

We are given a simple Markov reward process consisting of two states and with a reward of zero everywhere. When we are in state $s_{0}$, we always transition to $s_{1}$. If we are in state $s_{1}$, there is a probability $p$ (which is set to 0.1 by default) of terminating, after which the next episode starts in $s_{0}$ again. With a probability of $1 – p$, we transition from $s_{1}$ back to itself again. The discount is $gamma = 1$ on non-terminal steps.

Instead of a tabular representation, consider a single feature $phi$, which takes the values $phi(s_0) = 1$ and $phi(s_1) = 4$. Now consider using linear function approximation, where we learn a value $theta$ such that $v_{theta}(s) = theta times phi(s) approx v(s)$, where $v(s)$ is the true value of state $s$.

Suppose $theta_{0} = 1$, and suppose we update this parameter with TD(0) with a step size of $alpha = 0.1$. What is the expected value of $mathbb{E}[ theta_T ]$ if we step through the MRP until it terminates after the first episode, as a function of $p$? (Note that $T$ is random.)

My real point of confusion surrounds $theta_{0}$ being given as 1. My understanding was that the dimensionality of the parameter vector should be equal to that of the feature vector, which I’ve understood as being (1, 4) and thus two-dimensional. I also don’t grok the idea of evaluating $mathbb{E}[ theta_T ]$ should $theta$ be a scalar (as an aside I attempted to simply brute-force simulate the first episode using a scalar parameter of 1 and, unless I made errors, found the value of $theta$ to not depend on $p$ whatsoever). If $theta$ is two-dimensional, would that be represented as (1, 0), (0, 1), or (1, 1)?

Neither the 1-d or 2-d options make intuitive sense to me so I hope there’s something clear and obvious that someone might be able to point out. For more context or should someone just be interested in the assignment, here is a link to the Jupyter notebook:
https://github.com/chandu-97/ADL_RL/blob/master/RL_cw4_questions.ipynb

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP