TransWikia.com

Difference between text-based image retrieval and natural language object retrieval

Data Science Asked on June 27, 2021

I’m working on creating a model that locates the object in the scene (2D image or 3D scene) using a natural language query. I came across this paper on natural language object retrieval which mentions that this task is different from text-based image retrieval in the sense that natural language object retrieval requires an understanding of objects in the image, spatial configurations, etc. I am not able to see the difference between these two tasks. Could you please explain it with an example?

One Answer

Disclaimer: I can only answer for the NLP part since I'm no expert for image processing.

I assume that text-based image retrieval is the task of finding the image (or the part of an image) which corresponds to a short text which exclusively describes the object. Practically it means that any content word (i.e. excluding grammatical words like determiners) in the text refers directly to the object: "a bike", "a black cat", "the red car", etc. For a ML process it means that there's nothing to analyze in the text, every word can directly be associated with a characteristic of the image.

By contrast Natural Language object retrieval involves analyzing the text. For instance "the cat on the left of the picture" is different than "the picture on the left of the cat", even though the words are the same. Additionally there can be different ways to refer to the same object: "the book at the left of the shelf" may be the same as "the leftmost book" or "the book next to the green book". There are usually many ways to express the same meaning with language, and that makes the task much more complex. Additionally I would assume that mapping positional descriptions to the image characteristics can be tricky: "the man behind the tree" or "the second bridge" in a 2D image requires the model to "understand" depth. In a picture with two dogs, "the small dog" requires the model to "understand" size relation between objects. Humans intuitively know how to interpret these sentences, but for a machine Natural Language Understanding hasn't been solved yet (it might never be).

Answered by Erwan on June 27, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP