Expected performance of training tf.keras.Sequential model with, model.fit_generator and model.train_on_batch

Data Science Asked by Tuukka Nieminen on October 10, 2020

I am using Keras with Tensorflow backend to train a simple 1D CNN to detect specific events from sensor data. While the data with tens of millions samples easily fits to the ram in the form of an 1D float array, it obviously takes a huge amount of memory to store the data as a N x inputDim array that can be passed to for training. While I can use model.fit_generator or model.train_on_batch to generate the required mini batches on the fly, for some reason I am observing a huge performance gap between and model.fit_generator & model.train_on_batch even though everything is stored in memory and mini batch generation is fast as it basically only consists of reshaping the data. Therefore, I’m wondering whether I am doing something terribly wrong or if this kind of performance gap is to be expected. I am using the cpu version of Tensorflow 2.0 with 3.2 GHz Intel Core i7 processor (4 cores with multithreading support) and Python 3.6.3. on Mac Os X Mojave.

In short, I created a dummy python script to recreate the issue, and it reveals that with batch size of 64, it takes 407 seconds to run 10 epochs with, 1852 seconds with model.fit_generator, and 1985 seconds with model.train_on_batch. CPU loads are ~220%, ~130%, and ~120% respectively, and it seems especially odd that model.fit_generator & model.train_on_batch are practically on par, while model.fit_generator should be able to parallelise mini batch creation and model.train_on_batch definitely does not. That is, (with huge memory requirements) beats the other solution candidates with easily manageable memory requirements by a factor of four. Obviously, CPU loads increase and total training times decrease by increasing batch size, but is always fastest with a a margin of at least two up to batch size of 8096. In that case, takes 99 seconds to run 10 epochs with cpu load of ~860% (or pretty much everything I have got), model.fit_generator takes 179 seconds with cpu load of ~700%, and model.train_on_batch takes 198 seconds with CPU load of ~680%.

Is this kind of behaviour normal (when there is no GPU involved) or what could/should be done in order to increase the computational performance of the less memory intensive options with sensible batch sizes? Specifically model.fit_generator fails to provide decent performance. It seems that no such option is available to divide all data into manageable pieces, and then run in iterative manner with constantly changing training data.

Please do note that the provided dummy script is just what the name suggests, and the amount of data has been trimmed so that it makes all three options feasible. The used model, however, is similar to what I am actually using (to provide a realistic situation).

from tqdm       import tqdm

import numpy as np
import tensorflow as tf

import time
import sys
import argparse

inputData    = None
outputData   = None
batchIndices = None
opts         = None

class DataGenerator(tf.keras.utils.Sequence):

    global inputData
    global outputData
    global batchIndices

    'Generates data for Keras'
    def __init__(self, batchSize, shuffle):
        self.batchIndices = batchIndices
        self.batchSize    = batchSize
        self.shuffle      = shuffle

    def __len__(self):
        'Denotes the number of batches per epoch'
        return int( np.floor( inputData.size / self.batchSize ) )

    def __getitem__(self, index):
        'Generate one batch of data'

        # Generate data
        X, y = self.__data_generation(self.indexes[index*self.batchSize:(index+1)*self.batchSize])

        return X, y

    def on_epoch_end(self):
        'Updates indexes after each epoch'
        self.indexes = np.arange(inputData.size)
        if self.shuffle == True:

    def __data_generation(self, INDX):
        'Generates data containing batch_size samples'

        # Generate data
        X = np.expand_dims( inputData[ np.mod( batchIndices + np.reshape(INDX,(INDX.size,1)) , inputData.size ) ], axis=2)
        y = outputData[INDX,:] 

        return X, y

def main( ):

    global inputData
    global outputData
    global batchIndices
    global opts

    # Data generation

    print(' ')
    print('Generating data...')

    np.random.seed(0) # For reproducible results

    inputDim  = int(104)                      # Input  dimension
    outputDim = int(  2)                      # Output dimension
    N         = int(1049344)                  # Total number of samples
    M         = int(5e4)                      # Number of anomalies
    trainINDX = np.arange(N, dtype=np.uint32)

    inputData = np.sin(trainINDX) + np.random.normal(loc=0.0, scale=0.20, size=N) # Source data stored in a single array

    anomalyLocations = np.random.choice(N, M, replace=False)

    inputData[anomalyLocations] += 0.5

    outputData = np.zeros((N,outputDim)) # One-hot encoded target array without ones

    for i in range(N):
        if( np.any( np.logical_and( anomalyLocations >= i, anomalyLocations < np.mod(i+inputDim,N) ) ) ): 
            outputData[i,1] = 1 # set class #2 to one if there is at least a single anomaly within range [i,i+inputDim)
            outputData[i,0] = 1 # set class #1 to one if there are no anomalies within range [i,i+inputDim)

    print(' ')

    # Create a model for anomaly detection

    model = tf.keras.Sequential([
        tf.keras.layers.Conv1D(filters=24, kernel_size=9, strides=1, padding='valid', dilation_rate=1, activation='relu', use_bias=True, kernel_initializer='glorot_uniform', bias_initializer='zeros', input_shape=(inputDim,1)),
        tf.keras.layers.MaxPooling1D(pool_size=4, strides=None, padding='valid'),
        tf.keras.layers.Dense(20, activation='relu', use_bias=True),
        tf.keras.layers.Dense(outputDim, activation='softmax')

    model.compile( tf.keras.optimizers.Adam(),

    print(' ')

    relativeIndices = np.arange(inputDim)                            # Indices belonging to a single sample relative to current position
    batchIndices    = np.tile( relativeIndices, (opts.batchSize,1) ) # Relative indices tiled into an array of size ( batchSize , inputDim )  
    stepsPerEpoch   = int( np.floor( N / opts.batchSize ) )          # Steps per epoch

    # Create an intance of dataGenerator class
    generator = DataGenerator(batchSize=opts.batchSize, shuffle=True)

    # Solve by gathering data into a large float32 array of size ( N , inputDim ) and feeding it to

    startTime = time.time()

    X = np.expand_dims( inputData[ np.mod( np.tile(relativeIndices,(N,1)) + np.reshape(trainINDX,(N,1)) , N ) ], axis=2)
    y = outputData[trainINDX, :]

    history =, y=y, sample_weight=None, batch_size=opts.batchSize, verbose=1, callbacks=None, validation_split=None, shuffle=True, epochs=opts.epochCount)

    referenceTime = time.time() - startTime
    print(' ')
    print('Total solution time with %6.3f seconds' % referenceTime)
    print(' ')

    # Solve with model.fit_generator  

    startTime = time.time()

    history =, steps_per_epoch=stepsPerEpoch, verbose=1, callbacks=None, epochs=opts.epochCount, max_queue_size=1024, use_multiprocessing=False)

    generatorTime = time.time() - startTime
    print(' ')
    print('Total solution time with model.fit_generator: %6.3f seconds (%6.2f %% more)' % (generatorTime, 100.0 * generatorTime/referenceTime))
    print(' ')

    # Solve by gathering data into batches of size ( batchSize , inputDim ) and feeding it to model.train_on_batch

    startTime = time.time()

    for epoch in range(opts.epochCount):

        print(' ')
        print('Training epoch # %2d ...' % (epoch+1))
        print(' ')


        epochStartTime = time.time()

        for step in tqdm( range( stepsPerEpoch ) ):

            INDX = trainINDX[ step*opts.batchSize : (step+1)*opts.batchSize ]

            X = np.expand_dims( inputData[ np.mod( batchIndices + np.reshape(INDX,(opts.batchSize,1)) , N ) ], axis=2)
            y = outputData[INDX,:]

            history = model.train_on_batch(x=X, y=y, sample_weight=None, class_weight=None, reset_metrics=False)

        print(' ')
        print('...completed with loss = %9.6e, accuracy = %6.2f %%, %6.2f ms/step' % (history[0], 100.0*history[1], (1000*(time.time() - epochStartTime)/np.floor(trainINDX.size / opts.batchSize))))
        print(' ')

    batchTime = time.time() - startTime
    print(' ')
    print('Total solution time with model.train_on_batch: %6.3f seconds (%6.2f %% more)' % (batchTime, 100.0 * batchTime/referenceTime))
    print(' ')

parser = argparse.ArgumentParser()

parser.add_argument('--batchSize', type=int,
                help='Batch size')
parser.add_argument('--epochCount', type=int,
                help='Epoch count')

opts, unparsed = parser.parse_known_args()

if __name__== "__main__":
  main( )

One Answer

To answer the question myself, I recently updated to Python 3.7.7 and TensorFlow 2.2.0 rc2 and suddenly all my issues vanished. Now, running for 5 epochs with the default batch size of 128, with explicitly formed numpy arrays takes 126.162 seconds, with the provided generator takes 149.053 seconds, and model.train_on_batch takes 240.698 seconds. This with the default version of TensorFlow w/o support for AVX2 & FMA instructions supported by my CPU.

Answered by Tuukka Nieminen on October 10, 2020

Add your own answers!

Related Questions

Logbook: Machine Learning approaches

3  Asked on June 21, 2021 by jorge


Tweedie Loss for Keras

2  Asked on June 21, 2021 by odyse


Extracting structure and content from invoices

1  Asked on June 20, 2021 by don-draper


Speech Dataset for Spanish ASR

1  Asked on June 20, 2021 by dhiraj-bhalerao


DQL for detecting next move in games

0  Asked on June 20, 2021 by user117272


Ask a Question

Get help from others!

© 2023 All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP