Publications
2007
- "Perceived Object Trajectories During Occlusion Constrain Visual Statistical Learning", József Fiser, Brian J. Scholl, Richard N. Astin, Psychon Bull Rev. 2007 Feb;14(1):173-178.
Abstract / .PDF
Abstract
Visual statistical learning of shape sequences was examined in the context of occluded object trajectories. In a learning phase, participants viewed a sequence of moving shapes whose trajectories and speed profiles elicited either a bouncing or a streaming percept: The sequences consisted of a shape moving toward and then passing behind an occluder, after which two different shapes emerged from behind the occluder. At issue was whether statistical learning linked both object transitions equally, or whether the percept of either bouncing or streaming constrained the association between pre- and postocclusion objects. In familiarity judgments following the learning, participants reliably selected the shape pair that conformed to the bouncing or streaming bias that was present during the learning phase. A follow-up experiment demonstrated that differential eye movements could not account for this finding. These results suggest that sequential statistical learning is constrained by the spatiotemporal perceptual biases that bind two shapes moving through occlusion, and that this constraint thus reduces the computational complexity of visual statistical learning.
2005
-
"Methodological Challenges for Understanding Cognitive Development in Infants", Richard N. Aslin and József Fiser, Trends Cogn Sci. 2005 Mar; 9: 92-98.
Abstract / .PDF
Abstract
Studies of cognitive development in human infants have relied almost entirely on descriptive data at the behavioral level - the age at which a particular ability emerges. The underlying mechanisms of cognitive development remain largely unknown, despite attempts to correlate behavioral states with brain states. We argue that research on cognitive development must focus on theories of learning, and that these theories must reveal both the computational principles and the set of constraints that underlie developmental change. We discuss four specific issues in infant learning that gain renewed importance in light of this opinion.
-
"Encoding Multielement Scenes: Statistical Learning of Visual Feature Hierarchies", József Fiser and Richard N. Aslin, J Exp Psychol Gen. 2005 Nov; 134: 521-537.
Abstract /
.PDF
Abstract
The authors investigated how human adults encode and remember parts of multielement scenes composed
of recursively embedded visual shape combinations. The authors found that shape combinations
that are parts of larger configurations are less well remembered than shape combinations of the same kind
that are not embedded. Combined with basic echanisms of statistical learning, this embeddedness
constraint enables the development of complex new features for acquiring internal representations
efficiently without being computationally intractable. The resulting representations also encode parts and
wholes by chunking the visual input into components according to the statistical coherence of their
constituents. These results suggest that a bootstrapping approach of constrained statistical learning offers
a unified framework for investigating the formation of different internal representations in pattern and
scene perception.
2004
-
"Small Modulations of Ongoing Cortical Dynamics by Sensory Input During Natural Vision", József Fiser, Chiayu Chiu and Michael Weliky, Nature. 2004 Sep 30; 431:573-578.
Abstract / .PDF / Supplementary / Commentary / Commentary/
Abstract
During vision, it is believed that neural activity in the primary visual cortex is predominantly driven by sensory input from the environment. However, visual cortical neurons respond to repeated presentations of the same stimulus with a high degree of variability. Although this variability has been considered to be noise owing to random spontaneous activity within the cortex, recent studies show that spontaneous activity has a highly coherent spatio-temporal structure. This raises the possibility that the pattern of this spontaneous activity may shape neural responses during natural viewing conditions to a larger extent than previously thought. Here, we examine the relationship between spontaneous activity and the response of primary visual cortical neurons to dynamic natural-scene and random-noise film images in awake, freely viewing ferrets from the time of eye opening to maturity. The correspondence between evoked neural activity and the structure of the input signal was weak in young animals, but systematically improved with age. This improvement was linked to a shift in the dynamics of spontaneous activity. At all ages including the mature animal, correlations in spontaneous neural firing were only slightly modified by visual stimulation, irrespective of the sensory input. These results suggest that in both the developing and mature visual cortex, sensory evoked neural activity represents the modulation and triggering of ongoing circuit dynamics by input signals, rather than directly reflecting the structure of the input signal itself.
2003
-
"Contrast Conservation in Human Vision", József Fiser, Peter J. Bex, Walter Makous,.Vision Res. 2003 Nov; 43: 2637-2648.
Abstract / .PDF
Abstract
Visual experience, which is defined by brief saccadic sampling of complex scenes at high contrast, has typically been studied with static gratings at threshold contrast. To investigate how suprathreshold visual processing is related to threshold vision, we tested the temporal integration of contrast in the presence of large, sudden changes in the stimuli such occur during saccades under natural conditions. We observed completely different effects under threshold and suprathreshold viewing conditions. The threshold contrast of successively presented gratings that were either perpendicularly oriented or of inverted phase showed probability summation, implying no detectable interaction between independent visual detectors. However, at suprathreshold levels we found complete algebraic summation of contrast for stimuli longer than 53 ms. The same results were obtained during sudden changes between random noise patterns and between natural scenes. These results cannot be explained by traditional contrast gain-control mechanisms or the effect of contrast constancy. Rather, at suprathreshold levels, the visual system seems to conserve the contrast information from recently viewed images, perhaps for the efficient assessment of the contrast of the visual scene while the eye saccades from place to place.
- "Coding of Natural Scenes in Primary Visual Cortex", Michael Weliky, József Fiser, Ruskin H. Hunt, David N. Wagner,.Neuron. 2003 Feb 20; 37: 703-718.
Abstract /.PDF
Abstract
Natural scene coding in ferret visual cortex was investigated using a new technique for multi-site recording of neuronal activity from the cortical surface. Surface recordings accurately reflected radially aligned layer 2/3 activity. At individual sites, evoked activity to natural scenes was weakly correlated with the local image contrast structure falling within the cells' classical receptive field. However, a population code, derived from activity integrated across cortical sites having retinotopically overlapping receptive fields, correlated strongly with the local image contrast structure. Cell responses demonstrated high lifetime sparseness, population sparseness, and high dispersal values, implying efficient neural coding in terms of information processing. These results indicate that while cells at an individual cortical site do not provide a reliable estimate of the local contrast structure in natural scenes, cell activity integrated across distributed cortical sites is closely related to this structure in the form of a sparse and dispersed code.
2002
-
"Statistical Learning of New Visual Feature Combinations by Infants", József Fiser and Richard N. Aslin, Proc Natl Acad Sci U S A 2002 Nov 26; 99:15822-15826.
Abstract / .PDF / Commentary
Abstract
The ability of humans to recognize a nearly unlimited number of unique visual objects must be based on a robust and efficient learning mechanism that extracts complex visual features from the environment. To determine whether statistically optimal representations of scenes are formed during early development, we used a habituation paradigm with 9-month-old infants and found that, by mere observation of multielement scenes, they become sensitive to the underlying statistical structure of those scenes. After exposure to a large number of scenes, infants paid more attention not only to element pairs that cooccurred more often as embedded elements in the scenes than other pairs, but also to pairs that had higher predictability (conditional probability) between the elements of the pair. These findings suggest that, similar to lower-level visual representations, infants learn higher-order visual features based on the statistical coherence of elements within the scenes, thereby allowing them to develop an efficient representation for further associative learning.
-
"Statistical Learning of Higher-Order Temporal Structure From Visual Shape Sequences", József Fiser and Richard N. Aslin, J Exp Psychol Learn Mem Cogn. 2002 May; 28: 458-467.
Abstract / .PDF
Abstract
In 3 experiments, the authors investigated the ability of observers to extract the probabilities of successive shape co-occurrences during passive viewing. Participants became sensitive to several temporal-order statistics, both rapidly and with no overt task or explicit instructions. Sequences of shapes presented during familiarization were distinguished from novel sequences of familiar shapes, as well as from shape sequences that were seen during familiarization but less frequently than other shape sequences, demonstrating at least the extraction of joint probabilities of 2 consecutive shapes. When joint probabilities did not differ, another higher-order statistic (conditional probability) was automatically computed, thereby allowing participants to predict the temporal order of shapes. Results of a single-shape test documented that lower-order statistics were retained during the extraction of higher-order statistics. These results suggest that observers automatically extract multiple statistics of temporal events that are suitable for efficient associative learning of new temporal features.
-
"Eye-grabbing Insights", Bruce Bower Science News, Nov. 9, 2002; 162, 19: 293.
.PDF
2001
-
"To What Extent Can Matching Algorthms Based on Direct Outputs of Low Level Generic Descriptors Account for Human Object Recognition?", József Fiser and Eric E. Cooper, 2001 March 2; 1-35.
Abstract / .PDF
Abstract
A number of recent successful models of face recognition posit only two layers, an
input layer consisting of a lattice of spatial filters and a single subsequent stage by which those
descriptor values are mapped directly onto an object representation layer by standard matching
methods such as stochastic optimization. Is this approach sufficient for modeling human object
recognition? We tested whether a highly efficient version of such a two-layer model would
manifest effects similar to those shown by humans when given the task of recognizing images of
objects that had been employed in a series of psychophysical experiments. System accuracy was
quite high overall, but was qualitatively different from that evidenced by humans in object
recognition tasks. The discrepancy between the system's performance and human performance is
likely to be revealed by all models that map filter values directly onto object units. These results
suggest that human object recognition (as opposed to face recognition) may be difficult to
approximate by models that do not posit hidden units for explicit representation of intermediate
entities such as edges, viewpoint invariant classifiers, axes, shocks and/or object parts.
-
"Unsupervised Statistical Learning of Higher-Order Spatial Structures from Visual Scenes", József Fiser and Richard N. Aslin, Psychological Science, November 2001; 12: 499-504.
Abstract / .PDF
Abstract
Three experiments investigated the ability of human observers to extract the joint and conditional probabilities of shape co-occurrences during passive viewing of complex visual scenes. Results indicated that statistical learning of shape conjunctions was both rapid and automatic, as subjects were not instructed to attend to any particular features of the displays. Moreover, in addition to single-shape frequency, subjects acquired in parallel several different higher-order aspects of the statistical structure of the displays, including absolute shape-position relations in an array, shape-pair arrangements independent of position, and conditional probabilities of shape co-occurrences. Unsupervised learning of these higher-order statistics provides support for Barlow's theory of visual recognition, which posits that detecting "suspicious coincidences" of elements during recognition is a necessary prerequisite for efficient learning of new visual features.
-
"Size Tuning in the Absence of Spatial Frequency Tuning in Object Recognition", József Fiser, Suresh Subramaniam, Irving Biederman, Vision Research, 2001; 41, 1931-1950.
Abstract / .PDF
Abstract
How do we attend to objects at a variety of sizes as we view our visual world? Because of an advantage in identification of
lowpass over highpass filtered patterns, as well as large over small images, a number of theorists have assumed that
size-independent recognition is achieved by spatial frequency (SF) based coarse-to-fine tuning. We found that the advantage of
large sizes or low SFs was lost when participants attempted to identify a target object (specified verbally) somewhere in the middle
of a sequence of 40 images of objects, each shown for only 72 ms, as long as the target and distractors were the same size or
spatial frequency (unfiltered or low or high bandpassed). When targets were of a different size or scale than the distractors, a
marked advantage (pop out) was observed for large (unfiltered) and low SF targets against small (unfiltered) and high SF
distractors, respectively, and a marked decrement for the complementary conditions. Importantly, this pattern of results for large
and small images was unaffected by holding absolute or relative SF content constant over the different sizes and it could not be
explained by simple luminance- or contrast-based pattern masking. These results suggest that size/scale tuning in object
recognition was accomplished over the first several images (576 ms) in the sequence and that the size tuning was implemented
by a mechanism sensitive to spatial extent rather than to variations in spatial frequency.
-
"Invariance of Long-term Visual Priming to Scale Reflection, Translation, and Hemisphere", József Fiser and Irving Biederman, Vision Research, 2001; 41, 221-234.
Abstract / .PDF
Abstract
The representation of shape mediating visual object priming was investigated. In two blocks of trials, subjects named images
of common objects presented for 185 ms that were bandpass filtered, either at high (10 cpd) or at low (2 cpd) center frequency
with a 1.5 octave bandwidth, and positioned either 5° right or left of fixation. The second presentation of an image of a given
object type could be filtered at the same or different band, be shown at the same or translated (and mirror reflected) position, and
be the same exemplar as that in the first block or a same-name different-shaped exemplar (e.g. a different kind of chair). Second
block reaction times (RTs) and error rates were markedly lower than they were on the first block, which, in the context of prior
results, was indicative of strong priming. A change of exemplar in the second block resulted in a significant cost in RTs and error
rates, indicating that a portion of the priming was visual and not just verbal or basic-level conceptual. However, a change in the
spatial frequency (SF) content of the image had no effect on priming despite the dramatic difference it made in appearance of the
objects. This invariance to SF changes was also preserved with centrally presented images in a second experiment. Priming was
also invariant to a change in left–right position (and mirror orientation) of the image. The invariance over translation of such a
large magnitude suggests that the locus of the representation mediating the priming is beyond an area that would be homologous
to posterior TEO in the monkey. We conclude that this representation is insensitive to low level image variations (e.g. SF, precise
position or orientation of features) that do not alter the basic part-structure of the object. Finally, recognition performance was
unaffected by whether low or high bandpassed images were presented either in the left or right visual field, giving no support to
the hypothesis of hemispheric differences in processing low and high spatial frequencies.
-
"Size Invariance in Visual Object Priming of Gray Scale Images", József Fiser and Irving Biederman, Perception, March 2, 2001.
Abstract / .PDF
Abstract
The strength of visual priming of briefly presented gray scale pictures of real world
objects, measured by naming reaction times and errors, was independent of whether the primed
picture of the object was presented in the same or different size than the original picture. These
findings replicate Biederman & Cooper’s (1992) results on size invariance in shape recognition,
which were obtained with line drawings, and extend them to the domain of gray level images.
Entry-level shape identification is based either predominantly on scale-invariant representations
incorporating orientation and depth discontinuities which are well captured by line drawings, or
both discontinuities and the representation derived from smooth gradual surface changes are scale invariant.
2000
-
"Experience-dependent Visual Cue Intergration Based on Consistencies Between Visual & Haptic Percepts", Joseph E. Atkins, József Fiser and Robert A. Jacobs, Vision Research, 2001; 41, 449-461.
Abstract / .PDF
Abstract
We study the hypothesis that observers can use haptic percepts as a standard against which the relative reliabilities of visual cues
can be judged, and that these reliabilities determine how observers combine depth information provided by these cues. Using a
novel visuo-haptic virtual reality environment, subjects viewed and grasped virtual objects. In Experiment 1, subjects were trained
under motion relevant conditions, during which haptic and visual motion cues were consistent whereas haptic and visual texture
cues were uncorrelated, and texture relevant conditions, during which haptic and texture cues were consistent whereas haptic and
motion cues were uncorrelated. Subjects relied more on the motion cue after motion relevant training than after texture relevant
training, and more on the texture cue after texture relevant training than after motion relevant training. Experiment 2 studied
whether or not subjects could adapt their visual cue combination strategies in a context-dependent manner based on context-dependent
consistencies between haptic and visual cues. Subjects successfully learned two cue combination strategies in parallel, and
correctly applied each strategy in its appropriate context. Experiment 3, which was similar to Experiment 1 except that it used a
more naturalistic experimental task, yielded the same pattern of results as Experiment 1 indicating that the findings do not depend
on the precise nature of the experimental task. Overall, the results suggest that observers can involuntarily compare visual and
haptic percepts in order to evaluate the relative reliabilities of visual cues, and that these reliabilities determine how cues are
combined during three-dimensional visual perception.
-
"Minimizing Binding Errors Using Learned Conjunctive Features", Bartlett W. Mel and József Fiser, Neural Computation, 12, 247-278.
Abstract / .PDF
Abstract
We have studied some of the design trade-offs governing visual representations
based on spatially invariant conjunctive feature detectors, with an
emphasis on the susceptibility of such systems to false-positive recognition
errors—Malsburg’s classical binding problem.We begin by deriving
an analytical model that makes explicit how recognition performance is
affected by the number of objects that must be distinguished, the number
of features included in the representation, the complexity of individual
objects, and the clutter load, that is, the amount of visual material in the
field of view in which multiple objects must be simultaneously recognized,
independent of pose, and without explicit segmentation. Using the
domain of text to model object recognition in cluttered scenes, we show
that with corrections for the nonuniform probability and nonindependence
of text features, the analytical model achieves good fits to measured
recognition rates in simulations involving a wide range of clutter loads,
word sizes, and feature counts.We then introduce a greedy algorithm for
feature learning, derived from the analytical model, which grows a representation
by choosing those conjunctive features that are most likely
to distinguish objects from the cluttered backgrounds in which they are
embedded.We show that the representations produced by this algorithm
are compact, decorrelated, and heavily weighted toward features of low
conjunctive order. Our results provide a more quantitative basis for understanding
when spatially invariant conjunctive features can support
unambiguous perception in multiobject scenes, and lead to several insights
regarding the properties of visual representations optimized for
specific recognition tasks.
1999
-
"Subordinate-level Object Classification Reexamined" Irving Biederman, Suresh Subramaniam, Moshe Bar, Peter Kalocsai, József Fiser, Psychological Research, 1999; 62: 131-153.
Abstract / .PDF
Abstract
The classication of a table as round rather
than square, a car as a Mazda rather than a Ford, a drill
bit as 3/8-inch rather than 1/4-inch, and a face as Tom
have all been regarded as a single process termed
``subordinate classi®cation.'' Despite the common label,
the considerable heterogeneity of the perceptual processing required to achieve such classifications requires,
minimally, a more detailed taxonomy. Perceptual information relevant to subordinate-level shape classications can be presumed to vary on continua of (a) the type of distinctive information that is present, nonaccidental or metric, (b) the size of the relevant contours or
surfaces, and (c) the similarity of the to-be-discriminated
features, such as whether a straight contour has to be
distinguished from a contour of low curvature versus
high curvature. We consider three, relatively pure cases.
Case 1 subordinates may be distinguished by a representation, a geon structural description (GSD), specify
ing a nonaccidental characterization of an object's large
parts and the relations among these parts, such as a
round table versus a square table. Case 2 subordinates
are also distinguished by GSDs, except that the distinctive GSDs are present at a small scale in a complex
object so the location and mapping of the GSDs are
contingent on an initial basic-level classi®cation, such as
when we use a logo to distinguish various makes of cars.
Expertise for Cases 1 and 2 can be easily achieved
through specification, often verbal, of the GSDs. Case 3
subordinates, which have furnished much of the grist for
theorizing with "view-based" template models, requireone metric discriminations. Cases 1 and 2 account for
the overwhelming majority of shape-based basic- and
subordinate-level object classifications that people can
and do make in their everyday lives. These classifications
are typically made quickly, accurately, and with only
modest costs of viewpoint changes. Whereas the activation of an array of multiscale, multiorientation filters,
presumed to be at the initial stage of all shape process
ing, may suffce for determining the similarity of the representations mediating recognition among Case 3
subordinate stimuli (and faces), Cases 1 and 2 require
that the output of these flters be mapped to classifiers
that make explicit the nonaccidental properties, parts,
and relations specified by the GSDs.
1998
- "Distance Modulation of Neural Activity in the Visual Cortex", Allan C. Dobbins, Richard M. Jeo, József Fiser, John M. Allman, Science, 24 July 1998; 281 (5376):552-555.
Abstract
/
.PDF
/
Commentary
Abstract
Humans use distance information to scale the size of objects. Earlier studies demonstrated changes in neural response as a function of gaze direction and gaze distance in the dorsal visual cortical pathway to parietal cortex. These findings have been interpreted as evidence of the parietal pathway's role in spatial representation. Here, distance-dependent changes in neural response were also found to be common in neurons in the ventral pathway leading to inferotemporal cortex of monkeys. This result implies that the information necessary for object and spatial scaling is common to all visual cortical areas.
Top of Page
|