How can you include information not present in an image for neural networks?

Question

I am training a CNN to identify objects in images (one label per image). However, I have additional information about these images that cannot be retrieved by looking at the image itself. In more detail, I'm talking about the physical location of this object. This information proved to be important when classifying these objects.

However, I can't think of a good solution to include this information in a image recognition model, since the CNN is classifying the object based on the pixel values and not on ordered feature data.

One possible solution I was thinking of was to have an additional simple ML model on tabular data (including mainly the location data), such as an SVM, to give give a certain additional weight to the output of the CNN. Would this be a good strategy? I can't seem to find anything in the literature about this.

Thanks in advance!

edit: Someone asked what I meant by 'location'. With the location I meant the physical location of where the image was taken, in context of a large 2d space. I don't want to go too deep into the domain, but it's basically an (x,y) vector on a surface area, and obviously this meta-data cannot be extracted by looking at the pixel values.

edit2: I want to propose an additional way I found that was useful, but was not mentioned in any answer. 
Instead of using the neural network to predict the classes, I use the neural network to produce features.

I removed the final layer, which resulted in an output of shape 1024x1. This obviously depends on the design of your network. Then, I can use these features together with the meta-data (in my case location data) in an additional model to make predictions, such an SVM or another NN.

Leevo · Accepted Answer

Other answers suggest to put an additional channel, I disagree. I think it's a very computationally intensive, time consuming process. Moreover, it forces non-pixel data to be processed by Conv filters, which doesn't make much sense IMHO.

I suggest you to establish a multi-input model. It would be composed by three parts:

A Convolutional part, to process pixel data,
A Feed-forward part to process non-image data,
Another Feed-forward part that elaborates the prediction based on the concatenation of the two outputs above.

You will need to instantiate them separately, then combine together in a Keras Model(). You will also need Concatenate() layers to combine the two different sources of data.

You can read more about the implementation of multi-input Neural Networks here.

n1k31t4 · Answer

The simplest thing to try out is to put the information in an extra channel of the image.

So if you have RGB channels, you could add a four channel which would simply be the location information you have, repeated for every pixel.

This creates a lot of redundancy, of course, but means you can take any standard image classifier and it will still work.

matthiaw91 · Answer

edit: after the edit in the question, 1) does not relate so much anymore, but 2) still does.

1) It depends a bit on the form of the location data. If you have a segmentation mask (i.e. another image with two colors denoting for each pixel if it belongs to the object or not), then going with another channel as n1k31t4 suggested might be a good idea.

2) If you have the coordinates or something in a vector form, Figure 2 in this paper shows a way to put the information together. Essentially, the authors concatenate the additional info (in your case the location data) to the output of the feature extractor and feed that into the classifier of the CNN.

QuadmasterXLII · Answer

In the specific case of knowing the location of the object in the image, one technique would be to crop and pad each training example so that the object is in the exact center. This way the extra information is passed to the neural network implicitly. This is how most face identification neural networks work.
If the "location" of the object is more abstract, like "bedroom" or "Spain," then I'd recommend concatenating the information to each pixel of the image. Don't be afraid to add a large number of extra input channels, neural networks handle this well. For example, Alpha Go has a 48 channel input layer.

vipin bansal · Answer

I will definitely try one more approach in this case, which is explained below.

Will use simple CNN architecture, followed by fully connected layers.
Say, now I have fully connected layer(FL) of size 100.
Using this FL apply another liner regression model(followed by activation layer...). Structure of linear regression would be:

y = w1i.FL(Ni)+ w2i.f1+ w3i.f2...and so on.

Ni = ith Neuron

f1, f2.. are the non image feature.

w1i, w2i, w3i.. are the weights

What I'm trying to achieve is, using the output of each FL
  neuron, I   will create a linear regression model, where other
  features would be your   non-image data.

Here my assumption is during training, it will boost the weights of neurons based upon the non-image data as well.

How can you include information not present in an image for neural networks?

5 Answers

Add your own answers!

Ask a Question