Home > M.R. Bauer Foundation > 1995 Summary Report > Edward Adelson, Ph.D.

Edward Adelson, Ph.D.


Professor of Psychology
MIT Department of Brain and Cognitive Sciences
Massachusetts Institute of Technology
May 4, 1995

New Direction in Vision and Video

"Compression" refers to techniques that reduce the amount of storage required to represent an image or sound. Entire families of compression techniques have been developed over the years in order to overcome particular limitations in central storage and transmission domains. The most important of these domains (from a commercial perspective) involve limited storage capacity of computer disk systems, and the difficulty of transmitting complex digital representations across transmission lines of limited bandwidth, e.g. voice-grade phone lines. One of the greatest challenges to image compression techniques is to transmit a sequence of full-motion digitized images at a sufficiently high rate (frames/second) to enable the receiver to reconstitute the image in real time and with the same motion characteristics as were present in the original.

Adelson divides image compression techniques into three families: low-level, mid-level and high-level, in analogy with the three domains used by researchers to distinguish various levels of neural and cognitive processing in human vision. At present virtually all techniques in regular use exploit low-level regularities (redundancies) present in all images. These low-level compression techniques include computer algorithms such as those represented by TIFF and JPEG formats. The commercial importance of pyramidal compression schemes, such as those used by Kodak computer imagery, is beyond question. Most current image coding systems rely on signal processing concepts such as transforms, VQ, and motion compensation ("deblurring"). In order to achieve significantly lower bit rates (higher levels of compression), it will be necessary to devise encoding schemes that involve mid-level and high-level computer vision. Model-based systems have been described, but these are usually restricted to some special class of images such as head-and-shoulders sequences.

The most sophisticated imaginable compression schemes, for full motion images, would resemble the high-level, cognitive processing associated with human vision. For example, imagine that one has a video image sequence depicting a person who repeatedly opens and closes his fist. Instead of transmitting this sequence of images pixel by pixel, a high-level system would transmit the first image in the sequence and then a descriptive tag formally equivalent to "person opens and closes fist." Note that formal equivalence does not require linguistic equivalence or, for that matter, even that the tag be coded in natural language terms. The receiver of this transmission would decode the semantic instruction and apply it appropriately to the first image, thereby reconstituting this aspect of the entire sequence. Adelson notes that at present it is impossible to use such high-level compression techniques with any degree of fidelity (we lack the proper language and interpretive structures that would enable such strategies to work). However it is possible to make real progress on compression schemes that operate at an intermediate level by exploiting mid-level regularities in images, including depth information and properties of surfaces. Compression schemes that exploit surface and depth properties of images should be able to achieve far greater compression than currently achievable by use of low-level algorithms alone.

Adelson's research focuses on image sequences depicting simple, but real (not "toy") sequences of images. He treats such sequences as a three-dimensional volume, with the dimensions of x, y, and t (time). Motion analysis involves orientation-selective filtering within this volume. Standard approaches to motion analysis assume that the optic flow is smooth; such techniques have trouble dealing with occlusion boundaries. Note that occlusion may momentarily remove an object from the scene, but an effective compression scheme must continue to represent that object so that when the object is no longer occluded the scheme will treat that object as the same entity as before the disappearance.

The most popular solution to the occlusion problem is to allow discontinuities in the flow field, imposing the smoothness constraint in a piece wise fashion. But there is a sense in which the discontinuities in flow are artifactual, resulting from the attempt to capture the motion of multiple overlapping objects in a single flow field. So Adelson decomposes the image sequence into a set of overlapping layers, where each layer's motion is described by a smooth flow field. The discontinuities in the description are then attributed to object opacities rather than to the flow itself, mirroring the structure of the scene.

Adelson has been using mid-level vision concepts to achieve a decomposition that can be applied to many domains of image material. He described a coding scheme based on a set of overlapping layers, i.e., a scheme in which a scene was automatically segmented into layers, much as it is believed the human visual system does. The layers, which are ordered in depth and move over one another, are then composited in an animation as used by Walt Disney Studios and others.

Based on these ideas, Adelson demonstrated a set of techniques for segmenting images into coherently moving regions using a fine motion analysis and clustering techniques. This allowed him to decompose an image into a set of layers along with information about occlusion and depth ordering. Adelson applied the techniques to the "flower garden" sequence (an industry-wide standard set of images that are a benchmark for compression work). They analyzed the flower garden scene into four layers, and represented the entire 30-frame sequence with a single image of each layer, along with associated motion parameters. The next step is to develop early and mid-level vision mechanisms that emulate the processing that occurs in the primate visual cortex, and to design algorithms that apply such transformations with high computational efficiency. The candidate cortical mechanisms would be useful for edge detection, texture analysis, motion analysis, and image enhancement (i.e. de - convolution to eliminate blurring, contrast enhancement, and spatial frequency enhancement).

Two domains being explored are charting football plays and extracting choreography from a ballet sequence. These description schemes were demonstrated during Adelson's talk by means of videos in which real-life motion sequences were seen first compressed and then successfully uncompressed.

 


 

Speaker Schedule  |  Reports from Previous Years
Top of Page | Life Sciences | Brandeis University