You probably don't think about it as you go about your daily routine, but you never see the same image twice. That is, the same pattern of light, shadow, and color never fall on your retina in the same way. When you see a friend's face, and you're talking to them, you have the subjective perception of stability and constancy. In reality, your eyes are flitting all over the place. When your eye makes a rapid movement to another location, that's called a saccade. When it sits still for a very short time, that's called a fixation. The average person makes three saccades per second.
Here's a famous image from a study by Yarbus from 1967:
It shows a picture of a female's face, and the eye tracking pattern of someone who looked at the face. When we see a face, we tend to flick our eyes constantly across it, though not in a random way. You can see that we tend to concentrate on prominent features like the eyes, nose, and mouth, which are rich in information that help us distinguish the face from others. You don't spend a lot of time looking at cheeks, because they don't give you as much information as the eyes do.
And while your eyes are constantly moving, most interesting objects, like people, are also in motion. So is your head and body. The successive images on your retina change very rapidly all the time, and yet when you stand in front of a painting in a gallery, it doesn't feel like you're viewing a movie with cuts every third of a second. But that's what's really going on.
So if what we're seeing is a rapidly changing movie, even when we're looking at a stationary object like an apple, how the heck do we learn that all these rapidly changing images relate to the same object?
One idea is that our brains learn to cluster together things that appear close together in space and close together in time. The idea is that, even though your eyes are flitting all over the apple, your brain is keeping track of how far your eyes are moving. If that's a small region, and the successive images occur close together in time, it's a good bet that the thing you're looking at is a coherent entity.
Same for images that vary even more dramatically over time. Think of something like an elephant. You recognize it from far away and up close, even though the image falling on your retina may be many times smaller or larger. You recognize it when it's rotated, even though it may look very different from the front and from the back. You also recognize it when it's upside down. You recognize it if it's painted green, on a cloudy day or a sunny day. Computers fail miserably at learning the recognize objects with this degree of variance. So how the heck do you do it?
Dileep George calls the method by which we group successive inputs into one representation temporal pooling. The paper by Cox et al. that I'll be referring to here discusses the temporal contiguity of input as a way of grouping it together. The idea is simple: Things that occur close together in time are probably closely related, e.g. they're a unitary object or concept.
Cox et al. did a clever experiment. They used artificial objects shown here:
The objects are similar, but they are distinct. What they did was, they had subjects look at a cross in the middle of a computer screen. Then they showed an object, like object A, to either the right or left of the cross, in the subject's peripheral vision. The subject would naturally saccade to the object. Here's a figure showing what a typical trial would be like:
Now, in another condition, the first three steps would be the same. Subject looks at the cross, object A appears to one side, and the subject saccades to fixate on the object. But in this condition, in the very short time it took the subject to saccade, they switched the object from A to A'. When you made a saccade, you're temporarily blind, so the subject is not even aware that the objects have been switched. Here's what that kind of trial looks like:
After a number of such trials, they tested the subjects with a "same/different" paradigm. This means they basically showed them A and A' together and said, are these the same objects, or are they different objects? Subjects from the normal condition found it easier to distinguish between A and A', while the subjects from the swapped condition were more likely to say that A and A' were the same object.
So what does this suggest? That we're more likely to group together images that occur in rapid succession. This type of work is very closely related to the modeling I'm doing for my dissertation work, and I'm going even further and suggesting that that clustering input based on spatial and temporal contiguity is the fundamental mechanism by which we learn...not just visual objects, but music, language, tactile input, and on and on.
Cox et al.'s experiment is a very clever, very nice way of demonstrating how the principle works in the visual domain.
David D Cox, Philip Meier, Nadja Oertelt, James J DiCarlo (2005). 'Breaking' position-invariant object recognition Nature Neuroscience, 8 (9), 1145-1147 DOI: 10.1038/nn1519