Expected performance of training tf.keras.Sequential model with model.fit, model.fit_generator and model.train_on_batch

Question

I am using Keras with Tensorflow backend to train a simple 1D CNN to detect specific events from sensor data. While the data with tens of millions samples easily fits to the ram in the form of an 1D float array, it obviously takes a huge amount of memory to store the data as a N x inputDim array that can be passed to model.fit for training. While I can use model.fit_generator or model.train_on_batch to generate the required mini batches on the fly, for some reason I am observing a huge performance gap between model.fit and model.fit_generator & model.train_on_batch even though everything is stored in memory and mini batch generation is fast as it basically only consists of reshaping the data. Therefore, I'm wondering whether I am doing something terribly wrong or if this kind of performance gap is to be expected. I am using the cpu version of Tensorflow 2.0 with 3.2 GHz Intel Core i7 processor (4 cores with multithreading support) and Python 3.6.3. on Mac Os X Mojave.

In short, I created a dummy python script to recreate the issue, and it reveals that with batch size of 64, it takes 407 seconds to run 10 epochs with model.fit, 1852 seconds with model.fit_generator, and 1985 seconds with model.train_on_batch. CPU loads are ~220%, ~130%, and ~120% respectively, and it seems especially odd that model.fit_generator & model.train_on_batch are practically on par, while model.fit_generator should be able to parallelise mini batch creation and model.train_on_batch definitely does not. That is, model.fit (with huge memory requirements) beats the other solution candidates with easily manageable memory requirements by a factor of four. Obviously, CPU loads increase and total training times decrease by increasing batch size, but model.fit is always fastest with a a margin of at least two up to batch size of 8096. In that case, model.fit takes 99 seconds to run 10 epochs with cpu load of ~860% (or pretty much everything I have got), model.fit_generator takes 179 seconds with cpu load of ~700%, and model.train_on_batch takes 198 seconds with CPU load of ~680%.

Is this kind of behaviour normal (when there is no GPU involved) or what could/should be done in order to increase the computational performance of the less memory intensive options with sensible batch sizes? Specifically model.fit_generator fails to provide decent performance. It seems that no such option is available to divide all data into manageable pieces, and then run model.fit in iterative manner with constantly changing training data.

Please do note that the provided dummy script is just what the name suggests, and the amount of data has been trimmed so that it makes all three options feasible. The used model, however, is similar to what I am actually using (to provide a realistic situation).

from tqdm       import tqdm

import numpy as np
import tensorflow as tf

import time
import sys
import argparse

inputData    = None
outputData   = None
batchIndices = None
opts         = None

class DataGenerator(tf.keras.utils.Sequence):

global inputData
    global outputData
    global batchIndices

'Generates data for Keras'
    def __init__(self, batchSize, shuffle):
        'Initialization'
        self.batchIndices = batchIndices
        self.batchSize    = batchSize
        self.shuffle      = shuffle
        self.on_epoch_end()

def __len__(self):
        'Denotes the number of batches per epoch'
        return int( np.floor( inputData.size / self.batchSize ) )

def __getitem__(self, index):
        'Generate one batch of data'

# Generate data
        X, y = self.__data_generation(self.indexes[index*self.batchSize:(index+1)*self.batchSize])

return X, y

def on_epoch_end(self):
        'Updates indexes after each epoch'
        self.indexes = np.arange(inputData.size)
        if self.shuffle == True:
            np.random.shuffle(self.indexes)

def __data_generation(self, INDX):
        'Generates data containing batch_size samples'

# Generate data
        X = np.expand_dims( inputData[ np.mod( batchIndices + np.reshape(INDX,(INDX.size,1)) , inputData.size ) ], axis=2)
        y = outputData[INDX,:]

return X, y

def main( ):

global inputData
    global outputData
    global batchIndices
    global opts

# Data generation

print(' ')
    print('Generating data...')

np.random.seed(0) # For reproducible results

inputDim  = int(104)                      # Input  dimension
    outputDim = int(  2)                      # Output dimension
    N         = int(1049344)                  # Total number of samples
    M         = int(5e4)                      # Number of anomalies
    trainINDX = np.arange(N, dtype=np.uint32)

inputData = np.sin(trainINDX) + np.random.normal(loc=0.0, scale=0.20, size=N) # Source data stored in a single array

anomalyLocations = np.random.choice(N, M, replace=False)

inputData[anomalyLocations] += 0.5

outputData = np.zeros((N,outputDim)) # One-hot encoded target array without ones

for i in range(N):
        if( np.any( np.logical_and( anomalyLocations >= i, anomalyLocations < np.mod(i+inputDim,N) ) ) ): 
            outputData[i,1] = 1 # set class #2 to one if there is at least a single anomaly within range [i,i+inputDim)
        else:
            outputData[i,0] = 1 # set class #1 to one if there are no anomalies within range [i,i+inputDim)

print('...completed')
    print(' ')

# Create a model for anomaly detection

model = tf.keras.Sequential([
        tf.keras.layers.Conv1D(filters=24, kernel_size=9, strides=1, padding='valid', dilation_rate=1, activation='relu', use_bias=True, kernel_initializer='glorot_uniform', bias_initializer='zeros', input_shape=(inputDim,1)),
        tf.keras.layers.MaxPooling1D(pool_size=4, strides=None, padding='valid'),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(20, activation='relu', use_bias=True),
        tf.keras.layers.Dense(outputDim, activation='softmax')
    ])

model.compile( tf.keras.optimizers.Adam(),
                   loss=tf.keras.losses.CategoricalCrossentropy(),
                   metrics=[tf.keras.metrics.CategoricalAccuracy()])

print(' ')

relativeIndices = np.arange(inputDim)                            # Indices belonging to a single sample relative to current position
    batchIndices    = np.tile( relativeIndices, (opts.batchSize,1) ) # Relative indices tiled into an array of size ( batchSize , inputDim )  
    stepsPerEpoch   = int( np.floor( N / opts.batchSize ) )          # Steps per epoch

# Create an intance of dataGenerator class
    generator = DataGenerator(batchSize=opts.batchSize, shuffle=True)

# Solve by gathering data into a large float32 array of size ( N , inputDim ) and feeding it to model.fit

startTime = time.time()

X = np.expand_dims( inputData[ np.mod( np.tile(relativeIndices,(N,1)) + np.reshape(trainINDX,(N,1)) , N ) ], axis=2)
    y = outputData[trainINDX, :]

history = model.fit(x=X, y=y, sample_weight=None, batch_size=opts.batchSize, verbose=1, callbacks=None, validation_split=None, shuffle=True, epochs=opts.epochCount)

referenceTime = time.time() - startTime
    print(' ')
    print('Total solution time with model.fit: %6.3f seconds' % referenceTime)
    print(' ')

# Solve with model.fit_generator

startTime = time.time()

history = model.fit(x=generator, steps_per_epoch=stepsPerEpoch, verbose=1, callbacks=None, epochs=opts.epochCount, max_queue_size=1024, use_multiprocessing=False)

generatorTime = time.time() - startTime
    print(' ')
    print('Total solution time with model.fit_generator: %6.3f seconds (%6.2f %% more)' % (generatorTime, 100.0 * generatorTime/referenceTime))
    print(' ')

# Solve by gathering data into batches of size ( batchSize , inputDim ) and feeding it to model.train_on_batch

startTime = time.time()

for epoch in range(opts.epochCount):

print(' ')
        print('Training epoch # %2d ...' % (epoch+1))
        print(' ')

np.random.shuffle(trainINDX)

epochStartTime = time.time()

for step in tqdm( range( stepsPerEpoch ) ):

INDX = trainINDX[ step*opts.batchSize : (step+1)*opts.batchSize ]

X = np.expand_dims( inputData[ np.mod( batchIndices + np.reshape(INDX,(opts.batchSize,1)) , N ) ], axis=2)
            y = outputData[INDX,:]

history = model.train_on_batch(x=X, y=y, sample_weight=None, class_weight=None, reset_metrics=False)

print(' ')
        print('...completed with loss = %9.6e, accuracy = %6.2f %%, %6.2f ms/step' % (history[0], 100.0*history[1], (1000*(time.time() - epochStartTime)/np.floor(trainINDX.size / opts.batchSize))))
        print(' ')

batchTime = time.time() - startTime
    print(' ')
    print('Total solution time with model.train_on_batch: %6.3f seconds (%6.2f %% more)' % (batchTime, 100.0 * batchTime/referenceTime))
    print(' ')

parser = argparse.ArgumentParser()

parser.add_argument('--batchSize', type=int,
                default=128,
                help='Batch size')
parser.add_argument('--epochCount', type=int,
                default=5,
                help='Epoch count')

opts, unparsed = parser.parse_known_args()

if __name__== "__main__":
  main( )
```

Tuukka Nieminen · Answer

To answer the question myself, I recently updated to Python 3.7.7 and TensorFlow 2.2.0 rc2 and suddenly all my issues vanished. Now,
running for 5 epochs with the default batch size of 128, model.fit with explicitly formed numpy arrays takes 126.162 seconds, model.fit with the provided generator takes 149.053 seconds, and model.train_on_batch takes 240.698 seconds. This with the default version of TensorFlow w/o support for AVX2 & FMA instructions supported by my CPU.

Expected performance of training tf.keras.Sequential model with model.fit, model.fit_generator and model.train_on_batch

One Answer

Add your own answers!

Ask a Question