TransWikia.com

Advantage Actor Critic model implementation with Tensorflowjs

Artificial Intelligence Asked by Sergiu Ionescu on January 28, 2021

I am trying to implement an Actor Critic method that controls an RC car. For this I have implemented a simulated environment and actor critic tensorflowjs models.

My intention is to train a model to navigate an environment without colliding with various obstacles.

For this I have the following:

State(continuous):

  • the sensors distance(left, middle, right): [0..1,0..1,0..1]

Action(discrete):

  • 4 possible actions(move forward, move back, turn left, turn right)

Reward(cumulative):

  • moving forward is encouraged
  • being close to an obstacle is penalized
  • colliding with an obstacle is penalized

The structure of the models:

buildActor() {
      const model = tf.sequential();
      model.add(tf.layers.inputLayer({inputShape: [this.stateSize],}));

      model.add(tf.layers.dense({
        units: parseInt(this.config.hiddenUnits),
        activation: 'relu',
        kernelInitializer: 'glorotUniform',
      }));

      model.add(tf.layers.dense({
        units: parseInt(this.config.hiddenUnits/2),
        activation: 'relu',
        kernelInitializer: 'glorotUniform',
      }));

      model.add(tf.layers.dense({
        units: this.actionSize,
        activation: 'softmax',
        kernelInitializer: 'glorotUniform',
      }));

      this.compile(model, this.actorLearningRate);

      return model;
    }
buildCritic() {
      const model = tf.sequential();

      model.add(tf.layers.inputLayer({inputShape: [this.stateSize],}));

      model.add(tf.layers.dense({
        units: parseInt(this.config.hiddenUnits),
        activation: 'relu',
        kernelInitializer: 'glorotUniform',
      }));

      model.add(tf.layers.dense({
        units: parseInt(this.config.hiddenUnits/2),
        activation: 'relu',
        kernelInitializer: 'glorotUniform',
      }));

      model.add(tf.layers.dense({
        units: this.valueSize,
        activation: 'linear',
        kernelInitializer: 'glorotUniform',
      }));

      this.compile(model, this.criticLearningRate);

      return model;
    }

The models are compiled with an adam optimized and huber loss:

compile(model, learningRate) {
      model.compile({
        optimizer: tf.train.adam(learningRate),
        loss: tf.losses.huberLoss,
      });
    }

Training:

trainModel(state, action, reward, nextState) {
      let advantages = new Array(this.actionSize).fill(0);

      let normalizedState = normalizer.normalizeFeatures(state);
      let tfState = tf.tensor2d(normalizedState, [1, state.length]);
      let normalizedNextState = normalizer.normalizeFeatures(nextState);
      let tfNextState = tf.tensor2d(normalizedNextState, [1, nextState.length]);

      let predictedCurrentStateValue = this.critic.predict(tfState).dataSync();
      let predictedNextStateValue = this.critic.predict(tfNextState).dataSync();

      let target = reward + this.discountFactor * predictedNextStateValue;
      let advantage = target - predictedCurrentStateValue;
      advantages[action] = advantage;
      // console.log(normalizedState, normalizedNextState, action, target, advantages);

      this.actor.fit(tfState, tf.tensor([advantages]), {
        epochs: 1,
      }).then(info => {
          this.latestActorLoss = info.history.loss[0];
          this.actorLosses.push(this.latestActorLoss);
        }
      );

      this.critic.fit(tfState, tf.tensor([target]), {
        epochs: 1,
      }).then(info => {
          this.latestCriticLoss = info.history.loss[0];
          this.criticLosses.push(this.latestCriticLoss);
        }
      );

      this.advantages.push(advantage);
      pushToEvolutionChart(this.epoch, this.latestActorLoss, this.latestCriticLoss, advantage);
      this.epoch++;
    }

You ca give the simulation a spin on https://sergiuionescu.github.io/esp32-auto-car/sym/sym.html .

I found that some behaviors are being picked up – the model learns to prioritize moving forward after a few episodes, but then hits the wall and it reprioritizes spinning – but seems to completely ‘forget’ that moving forward was ever prioritized.

I’ve been trying to follow https://keras.io/examples/rl/actor_critic_cartpole/ to a certain extent, but have not found an equivalent of the way back-propagation is handled there – GradientTape.

Is it possible to perform training similar to the Keras example in Tensorflowjs?

The theory i’ve went through on Actor Critic mentions that the Critic should estimate the reward yet to be obtain until the rest of the episode, but i am training the critic with:
reward + this.discountFactor * predictedNextStateValue where reward is the cumulative reward until the current step.
Should i keep track of a maximum total reward in previous episodes and subtract my reward from that instead?

When i am training the actor i am generating a zero filled advantages tensor:

let advantages = new Array(this.actionSize).fill(0);
let target = reward + this.discountFactor * predictedNextStateValue;
let advantage = target - predictedCurrentStateValue;
advantages[action] = advantage;

All other actions than the taken one will receive a 0 advantage. Could this discourage any previous actions the were proven beneficial?
Should i average out the advantages per state and action?

Thanks for having the patience to go trough all of this.

One Answer

After tinkering a bit more with my experiment, i got it to consistently manifest the intended behavior after around 200 episodes.

Changes to the model itself were minimal: i replaced the loss function on the actor to tf.losses.softmaxCrossEntropy.

Some changes to the training environment seemed to have a significant impact and improved training:

  • Ending the episode after the reward reaches a minimum threshold - this prevents the models being polluted when the car was stock in a corner or flat against a wall.
  • Making sure that the reward used in training was inline with the action that produced that reward - the model training in my case is asynchronous to the physics of the environment - i am sampling the simulation and providing inputs, but the simulation is not interrupted by model related processing.

Answered by Sergiu Ionescu on January 28, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP