I am using Keras with Tensorflow backend to train a simple 1D CNN to detect specific events from sensor data. While the data with tens of millions samples easily fits to the ram in the form of an 1D float array, it obviously takes a huge amount of memory to store the data as a N x inputDim array that can be passed to model.fit for training. While I can use model.fit_generator or model.train_on_batch to generate the required mini batches on the fly, for some reason I am observing a huge performance gap between model.fit and model.fit_generator & model.train_on_batch even though everything is stored in memory and mini batch generation is fast as it basically only consists of reshaping the data. Therefore, I’m wondering whether I am doing something terribly wrong or if this kind of performance gap is to be expected. I am using the cpu version of Tensorflow 2.0 with 3.2 GHz Intel Core i7 processor (4 cores with multithreading support) and Python 3.6.3. on Mac Os X Mojave.
In short, I created a dummy python script to recreate the issue, and it reveals that with batch size of 64, it takes 407 seconds to run 10 epochs with model.fit, 1852 seconds with model.fit_generator, and 1985 seconds with model.train_on_batch. CPU loads are ~220%, ~130%, and ~120% respectively, and it seems especially odd that model.fit_generator & model.train_on_batch are practically on par, while model.fit_generator should be able to parallelise mini batch creation and model.train_on_batch definitely does not. That is, model.fit (with huge memory requirements) beats the other solution candidates with easily manageable memory requirements by a factor of four. Obviously, CPU loads increase and total training times decrease by increasing batch size, but model.fit is always fastest with a a margin of at least two up to batch size of 8096. In that case, model.fit takes 99 seconds to run 10 epochs with cpu load of ~860% (or pretty much everything I have got), model.fit_generator takes 179 seconds with cpu load of ~700%, and model.train_on_batch takes 198 seconds with CPU load of ~680%.
Is this kind of behaviour normal (when there is no GPU involved) or what could/should be done in order to increase the computational performance of the less memory intensive options with sensible batch sizes? Specifically model.fit_generator fails to provide decent performance. It seems that no such option is available to divide all data into manageable pieces, and then run model.fit in iterative manner with constantly changing training data.
Please do note that the provided dummy script is just what the name suggests, and the amount of data has been trimmed so that it makes all three options feasible. The used model, however, is similar to what I am actually using (to provide a realistic situation).
from tqdm import tqdm import numpy as np import tensorflow as tf import time import sys import argparse inputData = None outputData = None batchIndices = None opts = None class DataGenerator(tf.keras.utils.Sequence): global inputData global outputData global batchIndices 'Generates data for Keras' def __init__(self, batchSize, shuffle): 'Initialization' self.batchIndices = batchIndices self.batchSize = batchSize self.shuffle = shuffle self.on_epoch_end() def __len__(self): 'Denotes the number of batches per epoch' return int( np.floor( inputData.size / self.batchSize ) ) def __getitem__(self, index): 'Generate one batch of data' # Generate data X, y = self.__data_generation(self.indexes[index*self.batchSize:(index+1)*self.batchSize]) return X, y def on_epoch_end(self): 'Updates indexes after each epoch' self.indexes = np.arange(inputData.size) if self.shuffle == True: np.random.shuffle(self.indexes) def __data_generation(self, INDX): 'Generates data containing batch_size samples' # Generate data X = np.expand_dims( inputData[ np.mod( batchIndices + np.reshape(INDX,(INDX.size,1)) , inputData.size ) ], axis=2) y = outputData[INDX,:] return X, y def main( ): global inputData global outputData global batchIndices global opts # Data generation print(' ') print('Generating data...') np.random.seed(0) # For reproducible results inputDim = int(104) # Input dimension outputDim = int( 2) # Output dimension N = int(1049344) # Total number of samples M = int(5e4) # Number of anomalies trainINDX = np.arange(N, dtype=np.uint32) inputData = np.sin(trainINDX) + np.random.normal(loc=0.0, scale=0.20, size=N) # Source data stored in a single array anomalyLocations = np.random.choice(N, M, replace=False) inputData[anomalyLocations] += 0.5 outputData = np.zeros((N,outputDim)) # One-hot encoded target array without ones for i in range(N): if( np.any( np.logical_and( anomalyLocations >= i, anomalyLocations < np.mod(i+inputDim,N) ) ) ): outputData[i,1] = 1 # set class #2 to one if there is at least a single anomaly within range [i,i+inputDim) else: outputData[i,0] = 1 # set class #1 to one if there are no anomalies within range [i,i+inputDim) print('...completed') print(' ') # Create a model for anomaly detection model = tf.keras.Sequential([ tf.keras.layers.Conv1D(filters=24, kernel_size=9, strides=1, padding='valid', dilation_rate=1, activation='relu', use_bias=True, kernel_initializer='glorot_uniform', bias_initializer='zeros', input_shape=(inputDim,1)), tf.keras.layers.MaxPooling1D(pool_size=4, strides=None, padding='valid'), tf.keras.layers.Flatten(), tf.keras.layers.Dense(20, activation='relu', use_bias=True), tf.keras.layers.Dense(outputDim, activation='softmax') ]) model.compile( tf.keras.optimizers.Adam(), loss=tf.keras.losses.CategoricalCrossentropy(), metrics=[tf.keras.metrics.CategoricalAccuracy()]) print(' ') relativeIndices = np.arange(inputDim) # Indices belonging to a single sample relative to current position batchIndices = np.tile( relativeIndices, (opts.batchSize,1) ) # Relative indices tiled into an array of size ( batchSize , inputDim ) stepsPerEpoch = int( np.floor( N / opts.batchSize ) ) # Steps per epoch # Create an intance of dataGenerator class generator = DataGenerator(batchSize=opts.batchSize, shuffle=True) # Solve by gathering data into a large float32 array of size ( N , inputDim ) and feeding it to model.fit startTime = time.time() X = np.expand_dims( inputData[ np.mod( np.tile(relativeIndices,(N,1)) + np.reshape(trainINDX,(N,1)) , N ) ], axis=2) y = outputData[trainINDX, :] history = model.fit(x=X, y=y, sample_weight=None, batch_size=opts.batchSize, verbose=1, callbacks=None, validation_split=None, shuffle=True, epochs=opts.epochCount) referenceTime = time.time() - startTime print(' ') print('Total solution time with model.fit: %6.3f seconds' % referenceTime) print(' ') # Solve with model.fit_generator startTime = time.time() history = model.fit(x=generator, steps_per_epoch=stepsPerEpoch, verbose=1, callbacks=None, epochs=opts.epochCount, max_queue_size=1024, use_multiprocessing=False) generatorTime = time.time() - startTime print(' ') print('Total solution time with model.fit_generator: %6.3f seconds (%6.2f %% more)' % (generatorTime, 100.0 * generatorTime/referenceTime)) print(' ') # Solve by gathering data into batches of size ( batchSize , inputDim ) and feeding it to model.train_on_batch startTime = time.time() for epoch in range(opts.epochCount): print(' ') print('Training epoch # %2d ...' % (epoch+1)) print(' ') np.random.shuffle(trainINDX) epochStartTime = time.time() for step in tqdm( range( stepsPerEpoch ) ): INDX = trainINDX[ step*opts.batchSize : (step+1)*opts.batchSize ] X = np.expand_dims( inputData[ np.mod( batchIndices + np.reshape(INDX,(opts.batchSize,1)) , N ) ], axis=2) y = outputData[INDX,:] history = model.train_on_batch(x=X, y=y, sample_weight=None, class_weight=None, reset_metrics=False) print(' ') print('...completed with loss = %9.6e, accuracy = %6.2f %%, %6.2f ms/step' % (history, 100.0*history, (1000*(time.time() - epochStartTime)/np.floor(trainINDX.size / opts.batchSize)))) print(' ') batchTime = time.time() - startTime print(' ') print('Total solution time with model.train_on_batch: %6.3f seconds (%6.2f %% more)' % (batchTime, 100.0 * batchTime/referenceTime)) print(' ') parser = argparse.ArgumentParser() parser.add_argument('--batchSize', type=int, default=128, help='Batch size') parser.add_argument('--epochCount', type=int, default=5, help='Epoch count') opts, unparsed = parser.parse_known_args() if __name__== "__main__": main( ) ```
To answer the question myself, I recently updated to Python 3.7.7 and TensorFlow 2.2.0 rc2 and suddenly all my issues vanished. Now, running for 5 epochs with the default batch size of 128, model.fit with explicitly formed numpy arrays takes 126.162 seconds, model.fit with the provided generator takes 149.053 seconds, and model.train_on_batch takes 240.698 seconds. This with the default version of TensorFlow w/o support for AVX2 & FMA instructions supported by my CPU.
Answered by Tuukka Nieminen on October 10, 2020
1 Asked on June 21, 2021
0 Asked on June 21, 2021 by molse
1 Asked on June 21, 2021 by nku
2 Asked on June 21, 2021 by michaelrazum
2 Asked on June 21, 2021 by duhaime
0 Asked on June 21, 2021 by faizi
1 Asked on June 21, 2021
0 Asked on June 21, 2021 by rachithr
1 Asked on June 20, 2021 by malyada-n
0 Asked on June 20, 2021 by mkerrig
1 Asked on June 20, 2021 by don-draper
0 Asked on June 20, 2021 by snowflakebladerunner
2 Asked on June 20, 2021 by the-dan
1 Asked on June 20, 2021 by nick-smith
0 Asked on June 19, 2021 by mara
Get help from others!