# Why does the error of my LSTM not decrease after 10 epochs?

Artificial Intelligence Asked by K. Do on December 13, 2020

Despite the problem being very simple, I was wondering why an LSTM network was not able to converge to a decent solution.

import numpy as np
import keras

X_train = np.random.rand(1000)
y_train = X_train
X_train = X_train.reshape((len(X_train), 1, 1))

model= keras.models.Sequential()

optimzer = keras.optimizers.SGD(lr=1e-1)

model.build(input_shape=(None, 1, 1))
model.compile(loss=keras.losses.mean_squared_error, optimizer=optimzer, metrics=['mae'])
history = model.fit(X_train, y_train, batch_size=16, epochs=100)


After 10 epochs, the algorithm seems to have reached its optimal solution (around 1e-4 RMSE), and is not able to improve further the results.

A simple Flatten + Dense network with similar parameters is however able to achieve 1e-13 RMSE.

I’m surprised the LSTM cell does not simply let the value through, is there something I’m missing with my parameters? Is LSTM only good for classification problems?

I think there are some problems with your approach.

Firstly, looking at the Keras documentation, LSTM expects an input of shape (batch_size, timesteps, input_dim). You're passing an input of shape (1000, 1, 1), which means that you're having "sequences" of 1 timestep.

RNNs have been proposed to capture temporal dependencies, but it's impossible to capture such dependencies when the length of each series is 1, and the numbers are randomly generated. If you want to create a more realistic scenario, I would suggest you generate a sine wave, since it has a smooth periodic oscillation. Afterward, increase the timesteps from 1, and you can test on the following timestamps (to make predictions).

For the second part, if you think about a normal RNN (I will explain for a simple RNN but you can imagine a similar flow for LSTM) and a Dense layer when applied to 1 timestamp, there are not so many many differences. The dense layer will have $$Y=f(XW + b)$$, where $$X$$ is the input, $$W$$ is the weight matrix, $$b$$ is the bias and $$f$$ is the activation function. Whereas RNN will have $$Y=f(XW_1 + W_2h_0 + b)$$, since is the first timestamp $$h_0$$ is $$0$$, so we can reduce it to $$Y=f(XW_1 +b)$$, which is identical with the Dense layer. I suspect that the results differences are caused by the activation functions, by default Dense layer has no activation function, and LSTM has tanh and sigmoid.

Answered by razvanc92 on December 13, 2020

## Related Questions

### Why are Target Networks used in Deep Q-Learning as opposed to the Expected Value equation?

1  Asked on September 10, 2020 by tmt

### Handling a Large Discrete Action Space in Deep Q Learning

0  Asked on September 9, 2020 by foxcharles

### What are the advantages and disadvantages of using LISP for constraint satisfaction in 3D space

1  Asked on September 7, 2020 by shashank-gargeshwari

### Local Search vs K-means Clustering

0  Asked on September 5, 2020 by kghatak

### What are the right algorithms for this open loop control problem

1  Asked on August 30, 2020 by toben-aus

### What is the state of the art solution for text classification for large corpora

0  Asked on August 22, 2020 by nick

### Non Max Suppression and Object Detection

1  Asked on August 16, 2020 by moe-kaung-kin

### Isn’t a simulation a great model for model-based reinforcement learning?

1  Asked on August 9, 2020 by ray-walker

### Correct problem statement for CNN. Stitching parts of the map

0  Asked on August 1, 2020 by green_wizard

### How can I predict the true label for data with incomplete features based on the trained model with data with more features?

0  Asked on July 26, 2020 by dae-young-park

### Why is Symbolic AI not so popular as ANN but used by IBM Deep Blue?

1  Asked on July 21, 2020 by datdinhquoc