"Compression"
refers to techniques that reduce the amount of storage
required to represent an image or sound. Entire families
of compression techniques have been developed over the
years in order to overcome particular limitations in
central storage and transmission domains. The most important
of these domains (from a commercial perspective) involve
limited storage capacity of computer disk systems, and
the difficulty of transmitting complex digital representations
across transmission lines of limited bandwidth, e.g.
voice-grade phone lines. One of the greatest challenges
to image compression techniques is to transmit a sequence
of full-motion digitized images at a sufficiently high
rate (frames/second) to enable the receiver to reconstitute
the image in real time and with the same motion characteristics
as were present in the original.
Adelson divides
image compression techniques into three families: low-level,
mid-level and high-level, in analogy with the three
domains used by researchers to distinguish various levels
of neural and cognitive processing in human vision.
At present virtually all techniques in regular use exploit
low-level regularities (redundancies) present in all
images. These low-level compression techniques include
computer algorithms such as those represented by TIFF
and JPEG formats. The commercial importance of pyramidal
compression schemes, such as those used by Kodak computer
imagery, is beyond question. Most current image coding
systems rely on signal processing concepts such as transforms,
VQ, and motion compensation ("deblurring"). In order
to achieve significantly lower bit rates (higher levels
of compression), it will be necessary to devise encoding
schemes that involve mid-level and high-level computer
vision. Model-based systems have been described, but
these are usually restricted to some special class of
images such as head-and-shoulders sequences.
The most
sophisticated imaginable compression schemes, for full
motion images, would resemble the high-level, cognitive
processing associated with human vision. For example,
imagine that one has a video image sequence depicting
a person who repeatedly opens and closes his fist. Instead
of transmitting this sequence of images pixel by pixel,
a high-level system would transmit the first image in
the sequence and then a descriptive tag formally equivalent
to "person opens and closes fist." Note that formal
equivalence does not require linguistic equivalence
or, for that matter, even that the tag be coded in natural
language terms. The receiver of this transmission would
decode the semantic instruction and apply it appropriately
to the first image, thereby reconstituting this aspect
of the entire sequence. Adelson notes that at present
it is impossible to use such high-level compression
techniques with any degree of fidelity (we lack the
proper language and interpretive structures that would
enable such strategies to work). However it is possible
to make real progress on compression schemes that operate
at an intermediate level by exploiting mid-level regularities
in images, including depth information and properties
of surfaces. Compression schemes that exploit surface
and depth properties of images should be able to achieve
far greater compression than currently achievable by
use of low-level algorithms alone.
Adelson's
research focuses on image sequences depicting simple,
but real (not "toy") sequences of images. He treats
such sequences as a three-dimensional volume, with the
dimensions of x, y, and t (time). Motion analysis involves
orientation-selective filtering within this volume.
Standard approaches to motion analysis assume that the
optic flow is smooth; such techniques have trouble dealing
with occlusion boundaries. Note that occlusion may momentarily
remove an object from the scene, but an effective compression
scheme must continue to represent that object so that
when the object is no longer occluded the scheme will
treat that object as the same entity as before the disappearance.
The most
popular solution to the occlusion problem is to allow
discontinuities in the flow field, imposing the smoothness
constraint in a piece wise fashion. But there is a sense
in which the discontinuities in flow are artifactual,
resulting from the attempt to capture the motion of
multiple overlapping objects in a single flow field.
So Adelson decomposes the image sequence into a set
of overlapping layers, where each layer's motion is
described by a smooth flow field. The discontinuities
in the description are then attributed to object opacities
rather than to the flow itself, mirroring the structure
of the scene.
Adelson has
been using mid-level vision concepts to achieve a decomposition
that can be applied to many domains of image material.
He described a coding scheme based on a set of overlapping
layers, i.e., a scheme in which a scene was automatically
segmented into layers, much as it is believed the human
visual system does. The layers, which are ordered in
depth and move over one another, are then composited
in an animation as used by Walt Disney Studios and others.
Based on
these ideas, Adelson demonstrated a set of techniques
for segmenting images into coherently moving regions
using a fine motion analysis and clustering techniques.
This allowed him to decompose an image into a set of
layers along with information about occlusion and depth
ordering. Adelson applied the techniques to the "flower
garden" sequence (an industry-wide standard set of images
that are a benchmark for compression work). They analyzed
the flower garden scene into four layers, and represented
the entire 30-frame sequence with a single image of
each layer, along with associated motion parameters.
The next step is to develop early and mid-level vision
mechanisms that emulate the processing that occurs in
the primate visual cortex, and to design algorithms
that apply such transformations with high computational
efficiency. The candidate cortical mechanisms would
be useful for edge detection, texture analysis, motion
analysis, and image enhancement (i.e. de - convolution
to eliminate blurring, contrast enhancement, and spatial
frequency enhancement).
Two domains
being explored are charting football plays and extracting
choreography from a ballet sequence. These description
schemes were demonstrated during Adelson's talk by means
of videos in which real-life motion sequences were seen
first compressed and then successfully uncompressed.