How would you build a big production ready image training dataset from scratch?

Question

How would you most likely create a large production ready image training dataset from scratch including annotations for a image classification task?
We will take a large amount of images (~1 million) with industrial cameras and save them in a S3 bucket. Do you think a data lake infrastructure is necessary?
In your opinion, what are the most suitable methods for annotating the images in the shortest possible time (bounding boxes not needed).
Solutions that I have been able to find so far are the following:

Use an open source web based image annotation tool like make-sense or LOST (Problem: who will annotate the images? These tools doesn't seem to be perfect for big amount of image data). See also awesome-data-labeling
Build a gamificated web application and let users annotate images and earn discount codes to motivate them
Use third party tools with annotation workforces like Playment, Labelbox, Amazon Mechanical Turk

Are there any options I missed? In principle, it would be possible to pay for the annotation, but should be avoided or kept as small as possible.
Are there things that should be considered architecturally with such a large database?

Erwan · Answer

I'm not expert in images classification, I'm just going to give some general advice here.
The strategy should be progressive, for instance:

Proof of concept: devise a first draft of annotation process with some general guidelines, then take a random subset of a few hundreds images and try annotating them following the guidelines. Observe carefully the problems found at the annotation stage: any ambiguous cases, any problems with the granularity of the classes (for instance a class which is too general), potential error cases. Devise a training/testing setup to train a mockup system, test it with the small dataset, fix any bugs in the process, possibly try different methods. The goal of this stage is to iron out the different stages, possibly rule out some options which turn out to be unworkable.
Prototype: have a team of a few laypeople annotate a few thousands images following the updated annotation guidelines. Again pay attention to any problem, especially human errors in the process. At least a subset of the images should be annotated by several annotators, in order to detect differences. Ask annotators for feedback: were there any difficulties, was it fun, would they do it as a game or not, etc. At this stage you can start having a real ML system, albeit not optimal yet. Analyze the performance of the system on the different classes, possibly try different methods, estimate the minimum number of images per class to obtain decent performance. At this stage a lot of other tests could be done, and it should become clear how accurate the manual annotation is and whether annotators need to be compensated or not.

It's only around this stage that the final annotation process can be fully designed. Depending on the strategy, you could consider devising a strategy for iterative manual annotation: some classes are going to be quickly learned by the model, so it could make sense to use the model in order to propose for annotation at the next round the images for which the model fails. Be careful to avoid bias, keep evaluating and refining the model at every round of annotations.

How would you build a big production ready image training dataset from scratch?

One Answer

Add your own answers!

Ask a Question