Immediate NaN in loss function with custom activation without extreme batch size--how to prevent exploding gradients?

Question

Using a custom activation function, when using SGD as an optimiser, except for setting the batch number to an excessively high value the loss will return as an NaN at some stage during training. Using Adam as an optimiser, this happens immediately regardless of batch size.
The reduced version of code used to test this:
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

class CustomActivation(tf.keras.layers.Layer):
    def __init__(self):
        super(CustomActivation, self).__init__()

def call(self, inputs):
        x, y = inputs
        return 1.0 / (1.0 + tf.math.exp(-1*(0.5*x*(1+tf.math.exp(2*x*y)))))

# load data
fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

# divide to be between 0 and 1
train_images = train_images / 255.0
test_images = test_images / 255.0

stuff = []
for i in range (60000):
    stuff.append([0])

stuff = np.asarray(stuff)

stuff2 = []
for i in range (10000):
    stuff2.append([0])

stuff2 = np.asarray(stuff2)

# declare inputs
input1 = keras.Input(shape=(28,28,))
input2 = keras.Input(shape=(1,))

#flatten
flat1 = layers.Flatten()(input1)

# weight and output layers
primary_1 = layers.Dense(10,)(flat1)
secondary_1 = layers.Dense(10,)(input2)
out = CustomActivation()([primary_1,secondary_1])

# declare model
model = keras.Model(inputs=[input1,input2],outputs=out)
model.summary()

# train and test
opt = keras.optimizers.SGD(lr=0.05)

model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=opt,
    metrics=["accuracy"],
)

model.fit([train_images,stuff], train_labels, batch_size=20480, epochs=10)

test_loss, test_acc = model.evaluate([test_images,stuff2], test_labels, verbose=2)

print('nTest accuracy:', test_acc)

If the secondary input is set to any value other than zeros, then even a batch size of 20480 is too small.
As this is the fashion MNIST dataset, there are no NaN values in the input. Running such a large batch size in a more complicated network is unfeasible. Including l2 or l1 regularisation doesn't allow for reducing the batch size.
It seems like a classic case of exploding gradients; the partial derivatives of activation for x and y are (calculated normally):
$$
frac{dA}{dx}=frac{1}{2}+(frac{1}{2}+xy)exp(2xy)
$$
and
$$
frac{dA}{dy}=x^{2}exp(2xy)
$$
However, it seems that as of Tensorflow 2, gradient clipping is deprecated--at minimum, no longer referenced in the documentation--so what can be done to avoid running into NaN so quickly in training?

Immediate NaN in loss function with custom activation without extreme batch size--how to prevent exploding gradients?

Add your own answers!

Ask a Question