Difference between AlphaGo's policy network and value network

Question

I was reading a high level summary about Google's AlphaGo, and I came across the terms "policy network" and "value network". At a high level, I understand that the policy network is used to suggest moves and the value network is used to, "Reduce the depth of the search tree [and estimate] the winner in each position in place of searching all the way to the end of the game."
These two networks seem redundant to me. What is the policy network doing if it's not using the value network to prune its policies? It seems pretty clear that the value network is a deep learning neural network; is the policy network just a theoretical abstraction and not an actual neural network? The target variable for the value network seems to be win/loss. Is there a target variable for the policy network; if so, what is it? What is the policy network trying to optimize?
The full pdf of Google's paper, published in Nature, can be found here.

Joonatan Samuel · Answer

From what I understand the difference is in the outputs. Where policy network outputs a probability distribution over the possible moves, the value network returns a real value that can be interpreted as the probability of winning given this board configuration. From there Monte-Carlo tree search is performed via taking top K moves from and then narrowing the search tree again by taking top K value network outputs.

Feel obligated to correct me if I am wrong.

SmallChess · Answer

I think the OP was confusing about AlphaGo with alpha-beta. In alpha-beta, you'd indeed use the policy network for helping with pruning, but not here. Again, there is no pruning as the algorithm relies on Monte-Carlo tree search (MCTS).

Anyone who thinks my answer is too long might skip to the summary section, where I state why the two networks are not redundant.

In the following example, I'll do some simplification to make my ideas easier to understand.

Example:

Imagine you have a position where there are two legal moves. The first move is a dead-lost for you, however, the second move gives you a winning advantage.

First move: forced loss for you
Second move: forced win for you

Evaluation network

Let's assume the evaluation network Google gives you is perfect. It can evaluate any leaf position in our example perfectly. We won't change our value network in the example.

To simplify our example, let's assume our value network gives:

-1000 for any leaf position which is a loss for you
+1000 for any leaf position which is a win for you

Policy network

Let's assume Google gives you two policy networks. The probabilities generated for our position is:

Policy 1: 0.9 for move 1 and 0.1 for move 2
Policy 2: 0.2 for move 1 and 0.8 for move 2.

Note that our first policy network gives incorrect prior probability for our example. It gives 0.9 for move 1, which is a losing move. This is fine because not even Google could train a perfect policy network.

Playing with the first policy network

AlphaGo needs to generate a simulation with Monte-Carlo, and it needs to choose move 1 or 2. Now, AlphaGo draws a uniform-distributed random variable, and it'll pick:

Move 1 if the random number is <= 0.9
Move 2 if the random number is > 0.9

So AlphaGo is much more likely to pick the losing move to simulate (in our very first simulation). In our first simulation, we will also use the value network to get a score for the simulation. In the paper, it's:

This value would be -1000, because this simulation would lead to a loss.

Now, AlphaGo needs to generate the second simulation. Again, the first move would be much more likely to pick. But eventually, the second move would be pick because:

Our prior probability for the second move is 0.1, not zero
AlphaGo is encouraged to try moves which haven't been explored much. In the paper, this is done by this equation:

Note that N is the number of moves searched for the move and it's in the denominator. The more likely our first move is searched, the smaller the u function is. Thus, the probability for selecting our second move improves because AlphaGo actually picks a move by this equation:

This is the key equation. Please look at it carefully:

It has a term P for the prior probability (given by the policy network)
It has a term Q for the evaluation scores (given by the value network)

Now, we know our second move will eventually be chosen. When it does happen, the value network gives a +1000. This will increase Q, which makes the second move much more likely be chosen in the later simulations.

Given enough simulations, the number of times the second move is chosen for simulation should be more than the number of times the first move is chosen.

Finally, the move that AlphaGo decides to make is (quoted from the paper):

Once the search is complete, the algorithm chooses the most visited move from the root position.

Playing with the second policy network

Our second policy network will need less iterations to pick move 2 because it's prior probability given by the policy network is correct in the first place.

Remarks

Everything here is very similar to Bayesian analysis. We start off with some prior probability (given by the policy network), then we generate data to move the probability distirubtion (given by the value network).

Summaries

Policy network is used to generate prior probabilities to guide what move the Monte-Carlo search should pick
Value network is used to generate data to validate the policy network. If the policy network is bad, AlphaGo would need more computing resources to converge (if ever).
You can think of it like Bayesian analysis

Constantinos · Answer

In brief each net has a different purpose as you mentioned:

The value network was used at the leaf nodes to reduce the depth of the tree search.
The policy network was used to reduce the breadth of the search from a node (guiding towards promising immediate actions).

In general, you can use value function methods to find an optimal policy or directly search in the policy space to optimize a parametrized policy function (of course there are pros and cons). You can use function approximators (e.g Deep Nets) in each case. I see that mainly you are confused about the policy net so I focus my answer into this.

The policy net was first:

trained to do the moves that most likely a human would do given a board state (so input is a board state and output is a histogram that shows the probability of each action given that state). The net can approximate the probability function underlying the mapping from states to actions. It is reasonable to think to start building your policy from available data after all. After supervised training using experts moves the policy net could play the game sufficient (although far from a Master's level). Simply, you attempted to capture the general pattern of action selection of professional players.

Then,

it was trained in games with opponent itself, in order to optimize the previous-learned policy. This time its weights were updated using the REINFORCE algorithm. By doing this, you update the net parameters towards maximization of expected reward. Eventually you have a net that not only selects the actions like a professional player but also towards winning the game (However it cannot plan!).

After this step, they approximated the value function of a bit more noisy version of the learned policy, by regression (input is the state board and target the result of the game). You can use this network to affect the leaf node evaluation.

Conceptually speaking, the policy net gives you a probability over actions, but this doesn't indicate that you will end up in a good, for winning the game, state. AlphaGo had some "blind spots" and during the tournament did some really bad moves but also one exceptional move that a human could never had thought.

Finally you can use your planning algorithm (MCTS) in combination with these nets. Why we took all these steps? Briefly, the simple MCTS without any "intuition" would have failed.

Wind P. · Answer

Here is my concise thought process in understanding the two different networks.

First of all, the goal is to find an optimal solution (or very near-optimal) without using an exhaustive search, which is definitely challenging.

Per position or state, there will be N moves possible, and on each move there will be its own depth D in a full search tree. It is theoretically or mathematically possible to walk through all paths and find an optimal solution(s). However, we don't want to do a full search.

Now we got two separate questions for developing an approximation approach.

Q1. How can we skip or disregard some moves out of N per position? (i.e., breath reduction)

Q2. How can we stop at an intermediate depth in a search tree rather than walking through until the end of game, without failing to find an optimal solution? (i.e., depth reduction)

The policy network is mainly designed for filtering out useless moves out of N, yet without failing to find an optimal solution. Here this network initially relies on human expert moves, i.e., SL, and improved by RL later.

The value network is mainly designed for finding the winning probability without a full search.

These two networks have a common goal of finding an optimal solution, However, in each strategic choice of move, each network plays a different role.

I just hope this helps. I know it'd be still at a high level.

SAGAR SHARMA · Answer

Policy Network: The Network which learns to give a definite output by giving a particular Input to the game is known as Policy Network.

Value Networks: The value network assigns value/score to the state of the game by calculating an expected cumulative score for the current state s. Every state goes through the value network. The states which get more reward obviously get more value in the network.

Better understanding with Animations Go here: Policy Networks vs Value Networks in Reinforcement Learning

zyfo2 · Answer

I just have a very shallow understanding. But I feel in an idea network where you have trained every possible case, one network is enough. But it's not possible to do so in limited resources. So policy to reduce breadth and value to reduce depth. It's a probability result and but not a 100% perfect result.

Answered by zyfo2 on January 31, 2021

Difference between AlphaGo's policy network and value network

6 Answers

Add your own answers!

Ask a Question