Thoughts About Vision

Monday, June 20, 2016

Why do we see?

Why - because we have eyes, and their connected to the brain, of course.
Is it that simple? Not really. Our eyes are connected (not directly though) to the visual cortex. From that point, increasingly complex representations of the raw input are transformed into representations of scenes, objects, faces, etc.

But we do not only acknowledge the locations of all of these various object types. It's not as if we have some vague notion of the presence of faces, objects, hands, and everything else in the scene. We actually see an image, which means that we have access to the "raw data" of our sensors.
Of course, this is an extremely simplistic view of things, since the actual region of an image perceived with high acuity is actually deceptively small, and our brain works it out to make this less apparent. Everything around the very center of the field of view becomes increasingly blurry, allowing only motion and rough color and shape to be captured by our attention (and faces and hands, but that's a special issues). There are many other details about that which I'll not go into here - So forgive me for using the term "raw data" or "raw image" very loosely - for now assume it's just a given image.

How can this be true? this must mean that there is also some bypassing mechanism, allowing the "raw" data to be transmitted directly to an area in the brain where it can be examined in our conscious mind.

It is interesting to try to explain, in evolutionary terms, the existence of this feature, which seems non-trivial to implement. Why did vision evolve in this way? What are the advantages of such a dual representation? How does vision work for animal for which, I assume, there is not clear-cut notion of
consciousness, e.g., ants? Perhaps in such life-forms, they have access to the visual system's conclusions but not to the "raw image"...

Arguably, in humans (and other primates), there is an interplay between our perception of high-level concepts such as objects and scenes and our direct perception of the image. But it is almost certain that it will be a long time before we understand how this happens, how conflicts between the two are resolved, etc. As always, we can probably learn more from pathological cases, where the image is available to the conscious mind without understanding, and more curiously, maybe the understanding is available without access to the image.

This observation about the dual access to sensory inputs and their interpretation is true for most if not all senses, e.g., hearing, touch, taste, smell. It is also a crucial factor in the development of pattern recognition. Imagine a world where we only had access to our brain's interpenetration of the sensory data and not the data itself. Then everything would be taken for granted, e.g, we would naturally think that there is nothing more to the world than the things we perceive.
In fact, this is true for any possible setting: it is natural for any being to assume nothing else exists except what its own experience teaches it.

If anyone has physiological / evolutionary explanations for this - I would love to hear (see) them.

Sunday, March 18, 2012

Computer Vision Syndrome

Years of attempts to make machines see can have their effects on a person. It happens to me, rather often, that I walk down a street and try analyzing the current scene as if I were an instance of an algorithm I wrote, and noting where it would succeed and where it would fail. One of the side effects of this behavior is that as time progresses, I notice more and more objects that I fail to recognize myself on first site. Yes, contrary to the nice perception of the human visual system as a perfect one, we also tend to miss-classify objects (or miss them altogether) every once in a while.

So in a sense, instead of perfecting machines to see as I do, I find bugs in my own internal recognition system.

It is interesting that I am able to notice those bugs, as a normal system doesn't usually have a corrective behavior - it is either right, or wrong. Sometimes it even gives a confidence ("I am 90% sure this is an image of a cat, but of course there's a 10% chance it's a butterfly"). What the system doesn't do, is have another look from a different angle (if available), or give it more thought (address the problem to higher level analysis?).

When I first read the term "computer vision syndrome" on the web, I thought it referred to this strange behavior I adapted of (somewhat obsessively) introspecting my visual system as I live my day to day life. I was disappointed to find out that it is a condition caused by sitting too long near a computer for uninterrupted periods of time. Not that it is much of a comfort, but I probably have that as well.

Still, I'd like to make an abuse of notation and coin "computer vision syndrome" as a term describing scientists (or mere enthusiasts) in the field of computer vision dedicating their mind to the process of seeing -even when sometimes, they shouldn't.

Monday, March 5, 2012

Doing the Dishes

"One of the marks of a true science enthusiast", my thesis advisor once said, "is what you think about in your spare time". For example, in the shower, while driving or before going to sleep. I like thinking about vision, even (and especially) during mundane activities, such as doing the dishes. I enjoy asking - can a system endowed with state-of-the-art abilities deal with this specific task ?

Indeed the challenge inherent in computer vision can be expressed by pitting today's abilities against seemingly simple activities, such as doing the dishes :

Robots today are hardly able to fold towels. Not that this is not impressive, but the above task is probably harder : it includes dealing with a good deal of specularity, occlusion and clutter before the chaotic pile of dishes is perceived correctly. A good control system that would actually enable the robot to grasp (physically) a fork, separating it from the rest, is needed too.

Another challenge is doing dishes you never saw before (hopefully this won't happen too many times) - this would require an ability for generalization or smart modeling of what a (for example) fork actually is.

For those of us lazy enough to hope that one day soon there will be a robot doing our dishes (even putting them in the dishwasher is no small challenge) - I advise some patience.

Thursday, December 15, 2011

Human vs Machine Learning of the Visual World

I strive to eventually mimic the human ability to understand a scene.

One can ask the question, how far are we from achieving this goal. To answer the question, we must first specify what it means "to understand a scene". This is perhaps ill defined;however, here's a shot:

A dense labeling of each scene pixel, on many levels - object type (intermediate level - as most humans do). Material - metal, wood, asphalt, glass, plastic...segmentation boundaries, with a good assessment of object ordering.

what I do not strive to do, is achieve high precision 3D scene reconstructions of scenes or some other numeric results. I believe that we (humans) grasp scenes in a qualitative rather than quantitative manner. Using this qualitative understanding, we can, e.g, count objects by recognizing them or estimate the distance to a scene point based on some partial ordering of the objects.

In addition, I do not wish to use sensors that have abilities beyond the human vision, such as time of flight sensors or hyper-spectral imagery. Such multimodal inputs will surely make the task of scene interpretation much simpler. Still, with some rational justifications (and some less so), I insist of using sensors which mimic approximately those of humans:

1. The multimodal case will drive us further from a true understanding of the principles of human-like vision

2. multimodal sensors will probably be less ubiquitous and more expensive. I wish to make vision widely available without extraordinary hardware prerequisites. The advancement in computing power will probably continue to be faster paced than cheap multimodal sensor for the foreseeable future.

3. I see the goal as I pose it as a greater challenge and one that is more worthwhile solving.

But there is a considerable amount of work for each of these tasks. Are we going in the right direction? Most tasks are handled independently, such as identification of occlusion boundaries. Granted, segmentation and recognition have been used in the same framework but the ability to segment a scene with good precision cannot depend on the ability to identify each & every scene object... straying a bit, which comes first, segmentation or recognition? do they always come in the same order?

We can add more classifiers for specific object types, more computing power and larger datasets. Will this create a vision system that is as flexible as human beings'? probably not; the human visual system, besides having a good ability to classify previously known objects,has an excellent way of quickly learning new categories from a very few examples and from that moment identifying them with ease. This line of thought leads to the conclusion that what we need is stronger machine learning techniques. let us examine the process of learning in todays methods and compare it to that of a human infant.

The reason for this comparison is that observing an infant develop a good understanding of the visual world can give us bounds about the amount of time, supervision, examples, etc. is need to develop such a phenomenal system. So what are the differences between the human learning process and that of a machine (today)?

A human infant is bombarded with visual data from day 0. Note that at first, he/she probably doesn't understand the input data. In addition to this data, the child receives supervision - many objects are annotated. This, however, is done in a very loose manner. Many objects are never annotated, just mentioned without being pointed at. In addition, mentioning names is done in a language the child does not yet understand. So language is learned in tandem with the things it represents. In addition, the annotation is by far less specific and accurate than what is presented for most machine learning methods today. It is neither a segmentation or a bounding box. Sometimes an object is mentioned, sometimes pointed to, sometimes it is handed (when possible) to the child and then it is tangible, allowing for another modality - feeling. This modality is totally ignored to date as far as I know in the vision community although 3D data isn't. In addition, giving the child an object allows him/her to manipulate it at will and examine it (visually, haptically, audibly if relevant). But this is obviously not necessary for all objects, since otherwise there would be a need to feel an object before being able to declare you have seen an instance of one.

On the other hand, the amount of visual data incident on a childs retina is by far larger than that available in todays learning sets. PASCAL offers in the order of 10000 training images for some 20 object classes, meaning there are but a few hundred training examples for each object class. We expect the machine learning process to see tens or hundreds of examples of the same object class and have good recognition performance. The objects of a single class appear in different viewing conditions; not to mention the large inter-class variability. I find it hard to believe that given such an input and without using prior knowledge of the world, an untouched brain would do much better. It is perhaps the case that given the current inputs, the developed machine learning techniques do quite well - but we demand them to learn based on a very sparse learning set.

It is my observation that humans also receive complex and various inputs, but: a) they are not expected to perform as well on all inputs from day 1. b) some inputs are repeatedly shown to them, many more times than those in , say, PASCAL, from various viewpoints.

The bad news is that we should build richer learning sets, containing more data for each object. We should also think about the ORDER in which the examples in the training set are handed to the learner, so we receive no biasing or over-fitting. The good news is that we can make less of an effort to annotate each object at pixel level; My intuition is that bounding boxes probably suffice.

How much training is required to achieve the ability of the human visual system? Again, I look at human beings to give an upper bound. Assume a human of age 5 already has a matured ability to understand the world visually. Say the visual system always trains and receives new data about half of the time(the rest of the time is dedicated to e.g, sleeping). Multiply this by the maximal amount of data that passes through the optic nerve. Given the right hardware (e.g., a human brain) this is more than enough. How much information is stored in the brain? How is it represented? What is kept and what thrown away?