Asking for a faster method for extraction of data from Images

Question

Currently im writing a code where i extract images RGB Values using Opencv/PIL (tried both).
Then i put them through a function to calculate their mean and median and also the mean and median of the upper and lower parts.
Currently my code takes about 1 image per second and i need to do this for 10,000+ images that are stored in different subfolders of their categories.
I use numpy functions for mean and median.
Is there a faster way i can do this?
Edit : The images are all different sizes and have dimensions varying from say 1x1 to 1000x100 and are of the formats jpg, png and bmp
As for the code, i know accessing the image shouldn't take long but the problem lies in computing the mean and median of those arrays. Ill add a code snippet below of what it looks like (I apologize in advance if it looks bad)
I also write all of these mean and median values to an excel sheet in the end using xlwt which I hope shouldnt take long either.  
I use os.walk to traverse the directory after which i use 
img = os.path.join(dirName,fname) 
and get values from a function defined in another file 
values  = rgbavg(img) 
image = cv2.imread(image_path)
    img = np.array(image)
    img = img.transpose(2,0,1).reshape(3,-1)
    x, size = img.shape
    avg = np.mean(img, axis = 1)

for i in range(0,3):
        upper = np.array([])
        lower = np.array([])
        for ele in img[i]:
            if ele > avg[i]:
                upper = np.append(upper,ele)
            else:
                lower = np.append(lower,ele)

if upper.size != 0:
            mean = np.mean(upper)
            avg = np.append(avg,mean)
        else:
            avg = np.append(avg,0)
        if lower.size != 0:
            mean = np.mean(lower)
            avg = np.append(avg,mean)
        else:
            avg = np.append(avg,0)

IamFr0ssT · Answer

What is possible is that your program is spending a lot of time just waiting for read write operations and a lot of loops.

actual answer under updated

The waiting part you can mitigate by making use of multiple processes. Easiest way for you I think would be using a Pool. This also can increase your code speed by however many cores you have available.

First you would prepare your data (gather a list of all the files/file paths)

Then you would pass that as an argument so Pool creates processes and saves results

import multiprocessing
import time

files = ["img1.jpeg", "img2.jpeg", "img3.jpeg", "img4.jpeg"]

def process_image(path):
    # process image and return your data
    # time.sleep is only here to show that a different process is running as it executes faster than pool starts processes
    time.sleep(1)
    return [[0,0,0], [14,14,14], multiprocessing.current_process().name]

if __name__ == '__main__':
    with multiprocessing.Pool(processes=4) as pool:
        results = pool.map(process_image, files)
        print(results)

Updated:

After inspecting the original code I found the filtering is what takes a long time (in this case separating above and below average values in the array). Numpy has a faster filter:

boolean_array = img_array <|>|= value # returns a boolean array

and

filtered = img_array[boolean_array] # returns the filtered list

import multiprocessing
import numpy
import cv2


def process_image(xyz):
    img = numpy.random.randint(255, size=(1000,1000,3),dtype=numpy.uint8)
                      .transpose(2,0,1).reshape(3,-1)
    avg = numpy.mean(img, axis = 1)

    x, size = img.shape

    for i in range(0,3):
        upper = img[i][img[i] >= avg[i]]
        lower = img[i][img[i] < avg[i]]
        if upper.size != 0:
            # Why you are saving these idk but ok
            mean = numpy.mean(upper)
            avg = numpy.append(avg,mean)
        else:
            avg = numpy.append(avg,0)
        if lower.size != 0:
            mean = numpy.mean(lower)
            avg = numpy.append(avg,mean)
        else:
            avg = numpy.append(avg,0)
    
    return [avg, mean, multiprocessing.current_process().name]

if __name__ == '__main__':
    with multiprocessing.Pool(processes=4) as pool:
        results = pool.map(process_image, range(0,12))
        print(results)

This is with multiprocessing added

Asking for a faster method for extraction of data from Images

One Answer

Add your own answers!

Ask a Question