Thursday, December 15, 2011

Human vs Machine Learning of the Visual World

I strive to eventually mimic the human ability to understand a scene.
One can ask the question, how far are we from achieving this goal. To answer the question, we must first specify what it means "to understand a scene". This is perhaps ill defined;however, here's a shot:
A dense labeling of each scene pixel, on many levels - object type (intermediate level - as most humans do). Material - metal, wood, asphalt, glass, plastic...segmentation boundaries, with a good assessment of object ordering.

what I do not strive to do, is achieve high precision 3D scene reconstructions of scenes or some other numeric results. I believe that we (humans) grasp scenes in a qualitative rather than quantitative manner. Using this qualitative understanding, we can, e.g, count objects by recognizing them or estimate the distance to a scene point based on some partial ordering of the objects.

In addition, I do not wish to use sensors that have abilities beyond the human vision, such as time of flight sensors or hyper-spectral imagery. Such multimodal inputs will surely make the task of scene interpretation much simpler. Still, with some rational justifications (and some less so), I insist of using sensors which mimic approximately those of humans:
1. The multimodal case will drive us further from a true understanding of the principles of human-like vision
2. multimodal sensors will probably be less ubiquitous and more expensive. I wish to make vision widely available without extraordinary hardware prerequisites. The advancement in computing power will probably continue to be faster paced than cheap multimodal sensor for the foreseeable future.
3. I see the goal as I pose it as a greater challenge and one that is more worthwhile solving.

But there is a considerable amount of work for each of these tasks. Are we going in the right direction? Most tasks are handled independently, such as identification of occlusion boundaries. Granted, segmentation and recognition have been used in the same framework but the ability to segment a scene with good precision cannot depend on the ability to identify each & every scene object... straying a bit, which comes first, segmentation or recognition? do they always come in the same order?

We can add more classifiers for specific object types, more computing power and larger datasets. Will this create a vision system that is as flexible as human beings'? probably not; the human visual system, besides having a good ability to classify previously known objects,has an excellent way of quickly learning new categories from a very few examples and from that moment identifying them with ease. This line of thought leads to the conclusion that what we need is stronger machine learning techniques. let us examine the process of learning in todays methods and compare it to that of a human infant.
The reason for this comparison is that observing an infant develop a good understanding of the visual world can give us bounds about the amount of time, supervision, examples, etc. is need to develop such a phenomenal system. So what are the differences between the human learning process and that of a machine (today)?

A human infant is bombarded with visual data from day 0. Note that at first, he/she probably doesn't understand the input data. In addition to this data, the child receives supervision - many objects are annotated. This, however, is done in a very loose manner. Many objects are never annotated, just mentioned without being pointed at. In addition, mentioning names is done in a language the child does not yet understand. So language is learned in tandem with the things it represents. In addition, the annotation is by far less specific and accurate than what is presented for most machine learning methods today. It is neither a segmentation or a bounding box. Sometimes an object is mentioned, sometimes pointed to, sometimes it is handed (when possible) to the child and then it is tangible, allowing for another modality - feeling. This modality is totally ignored to date as far as I know in the vision community although 3D data isn't. In addition, giving the child an object allows him/her to manipulate it at will and examine it (visually, haptically, audibly if relevant). But this is obviously not necessary for all objects, since otherwise there would be a need to feel an object before being able to declare you have seen an instance of one.
On the other hand, the amount of visual data incident on a childs retina is by far larger than that available in todays learning sets. PASCAL offers in the order of 10000 training images for some 20 object classes, meaning there are but a few hundred training examples for each object class. We expect the machine learning process to see tens or hundreds of examples of the same object class and have good recognition performance. The objects of a single class appear in different viewing conditions; not to mention the large inter-class variability. I find it hard to believe that given such an input and without using prior knowledge of the world, an untouched brain would do much better. It is perhaps the case that given the current inputs, the developed machine learning techniques do quite well - but we demand them to learn based on a very sparse learning set.

It is my observation that humans also receive complex and various inputs, but: a) they are not expected to perform as well on all inputs from day 1. b) some inputs are repeatedly shown to them, many more times than those in , say, PASCAL, from various viewpoints.

The bad news is that we should build richer learning sets, containing more data for each object. We should also think about the ORDER in which the examples in the training set are handed to the learner, so we receive no biasing or over-fitting. The good news is that we can make less of an effort to annotate each object at pixel level; My intuition is that bounding boxes probably suffice.

How much training is required to achieve the ability of the human visual system? Again, I look at human beings to give an upper bound. Assume a human of age 5 already has a matured ability to understand the world visually. Say the visual system always trains and receives new data about half of the time(the rest of the time is dedicated to e.g, sleeping). Multiply this by the maximal amount of data that passes through the optic nerve. Given the right hardware (e.g., a human brain) this is more than enough. How much information is stored in the brain? How is it represented? What is kept and what thrown away?

5 comments:

  1. You do a lot of a questioning. I'll answer one: segmentation comes first. ;-)

    ReplyDelete
  2. Agreed, as long as it's made up of multiple ranked proposals :-)
    But seriously, I think the answer is in the middle - some of it is a byproduct of recognition and some is a preliminary stage (e.g, saliency)

    ReplyDelete
  3. Recognition helps finishing the job, but segmentation still comes first. Otherwise, why can't I find the face in the coffee beans ? http://joe-ks.com/archives_feb2004/NutMan.jpg

    ReplyDelete
  4. It's a good question. I think that there is a fine line between saliency and segmentation. From a biological point of view, it's probably pointless to accurately segment every object before recognizing it - too much waste of computational resources. And of course, some things will remain ambiguous until you've recognized them, or require the recognition of other parts of the image. Generic segmentation might work for many cases, but it is still an ill-posed problem as long as you're not application specific.

    ReplyDelete
  5. Agreed. But how do you describe the case "recognition comes first" ? Is it something like scene recognition to get prior information about the objects in it (as in your paper) ?

    ReplyDelete