research people publications Contact home
 
Publications

2007 | 2005 | 2004 | 2003 | 2002 | 2001 | 2000 | 1999 | 1998
2007

  • "Perceived Object Trajectories During Occlusion Constrain Visual Statistical Learning", József Fiser, Brian J. Scholl, Richard N. Astin, Psychon Bull Rev. 2007 Feb;14(1):173-178.
    Abstract / .PDF

    Abstract
    Visual statistical learning of shape sequences was examined in the context of occluded object trajectories. In a learning phase, participants viewed a sequence of moving shapes whose trajectories and speed profiles elicited either a bouncing or a streaming percept: The sequences consisted of a shape moving toward and then passing behind an occluder, after which two different shapes emerged from behind the occluder. At issue was whether statistical learning linked both object transitions equally, or whether the percept of either bouncing or streaming constrained the association between pre- and postocclusion objects. In familiarity judgments following the learning, participants reliably selected the shape pair that conformed to the bouncing or streaming bias that was present during the learning phase. A follow-up experiment demonstrated that differential eye movements could not account for this finding. These results suggest that sequential statistical learning is constrained by the spatiotemporal perceptual biases that bind two shapes moving through occlusion, and that this constraint thus reduces the computational complexity of visual statistical learning.

2005
  • "Methodological Challenges for Understanding Cognitive Development in Infants", Richard N. Aslin and József Fiser, Trends Cogn Sci. 2005 Mar; 9: 92-98.
    Abstract / .PDF

    Abstract
    Studies of cognitive development in human infants have relied almost entirely on descriptive data at the behavioral level - the age at which a particular ability emerges. The underlying mechanisms of cognitive development remain largely unknown, despite attempts to correlate behavioral states with brain states. We argue that research on cognitive development must focus on theories of learning, and that these theories must reveal both the computational principles and the set of constraints that underlie developmental change. We discuss four specific issues in infant learning that gain renewed importance in light of this opinion.

  • "Encoding Multielement Scenes: Statistical Learning of Visual Feature Hierarchies", József Fiser and Richard N. Aslin, J Exp Psychol Gen. 2005 Nov; 134: 521-537.
    Abstract / .PDF

    Abstract
    The authors investigated how human adults encode and remember parts of multielement scenes composed of recursively embedded visual shape combinations. The authors found that shape combinations that are parts of larger configurations are less well remembered than shape combinations of the same kind that are not embedded. Combined with basic echanisms of statistical learning, this embeddedness constraint enables the development of complex new features for acquiring internal representations efficiently without being computationally intractable. The resulting representations also encode parts and wholes by chunking the visual input into components according to the statistical coherence of their constituents. These results suggest that a bootstrapping approach of constrained statistical learning offers a unified framework for investigating the formation of different internal representations in pattern and scene perception.

2004

  • "Small Modulations of Ongoing Cortical Dynamics by Sensory Input During Natural Vision", József Fiser, Chiayu Chiu and Michael Weliky, Nature. 2004 Sep 30; 431:573-578.
    Abstract / .PDF / Supplementary / Commentary / Commentary/

    Abstract

    During vision, it is believed that neural activity in the primary visual cortex is predominantly driven by sensory input from the environment. However, visual cortical neurons respond to repeated presentations of the same stimulus with a high degree of variability. Although this variability has been considered to be noise owing to random spontaneous activity within the cortex, recent studies show that spontaneous activity has a highly coherent spatio-temporal structure. This raises the possibility that the pattern of this spontaneous activity may shape neural responses during natural viewing conditions to a larger extent than previously thought. Here, we examine the relationship between spontaneous activity and the response of primary visual cortical neurons to dynamic natural-scene and random-noise film images in awake, freely viewing ferrets from the time of eye opening to maturity. The correspondence between evoked neural activity and the structure of the input signal was weak in young animals, but systematically improved with age. This improvement was linked to a shift in the dynamics of spontaneous activity. At all ages including the mature animal, correlations in spontaneous neural firing were only slightly modified by visual stimulation, irrespective of the sensory input. These results suggest that in both the developing and mature visual cortex, sensory evoked neural activity represents the modulation and triggering of ongoing circuit dynamics by input signals, rather than directly reflecting the structure of the input signal itself.

2003

  • "Contrast Conservation in Human Vision", József Fiser, Peter J. Bex, Walter Makous,.Vision Res. 2003 Nov; 43: 2637-2648.
    Abstract / .PDF

    Abstract
    Visual experience, which is defined by brief saccadic sampling of complex scenes at high contrast, has typically been studied with static gratings at threshold contrast. To investigate how suprathreshold visual processing is related to threshold vision, we tested the temporal integration of contrast in the presence of large, sudden changes in the stimuli such occur during saccades under natural conditions. We observed completely different effects under threshold and suprathreshold viewing conditions. The threshold contrast of successively presented gratings that were either perpendicularly oriented or of inverted phase showed probability summation, implying no detectable interaction between independent visual detectors. However, at suprathreshold levels we found complete algebraic summation of contrast for stimuli longer than 53 ms. The same results were obtained during sudden changes between random noise patterns and between natural scenes. These results cannot be explained by traditional contrast gain-control mechanisms or the effect of contrast constancy. Rather, at suprathreshold levels, the visual system seems to conserve the contrast information from recently viewed images, perhaps for the efficient assessment of the contrast of the visual scene while the eye saccades from place to place.

  • "Coding of Natural Scenes in Primary Visual Cortex", Michael Weliky, József Fiser, Ruskin H. Hunt, David N. Wagner,.Neuron. 2003 Feb 20; 37: 703-718.
    Abstract /.PDF

    Abstract
    Natural scene coding in ferret visual cortex was investigated using a new technique for multi-site recording of neuronal activity from the cortical surface. Surface recordings accurately reflected radially aligned layer 2/3 activity. At individual sites, evoked activity to natural scenes was weakly correlated with the local image contrast structure falling within the cells' classical receptive field. However, a population code, derived from activity integrated across cortical sites having retinotopically overlapping receptive fields, correlated strongly with the local image contrast structure. Cell responses demonstrated high lifetime sparseness, population sparseness, and high dispersal values, implying efficient neural coding in terms of information processing. These results indicate that while cells at an individual cortical site do not provide a reliable estimate of the local contrast structure in natural scenes, cell activity integrated across distributed cortical sites is closely related to this structure in the form of a sparse and dispersed code.

2002

  • "Statistical Learning of New Visual Feature Combinations by Infants", József Fiser and Richard N. Aslin, Proc Natl Acad Sci U S A 2002 Nov 26; 99:15822-15826.
    Abstract / .PDF / Commentary

    Abstract
    The ability of humans to recognize a nearly unlimited number of unique visual objects must be based on a robust and efficient learning mechanism that extracts complex visual features from the environment. To determine whether statistically optimal representations of scenes are formed during early development, we used a habituation paradigm with 9-month-old infants and found that, by mere observation of multielement scenes, they become sensitive to the underlying statistical structure of those scenes. After exposure to a large number of scenes, infants paid more attention not only to element pairs that cooccurred more often as embedded elements in the scenes than other pairs, but also to pairs that had higher predictability (conditional probability) between the elements of the pair. These findings suggest that, similar to lower-level visual representations, infants learn higher-order visual features based on the statistical coherence of elements within the scenes, thereby allowing them to develop an efficient representation for further associative learning.

  • "Statistical Learning of Higher-Order Temporal Structure From Visual Shape Sequences", József Fiser and Richard N. Aslin, J Exp Psychol Learn Mem Cogn. 2002 May; 28: 458-467.
    Abstract / .PDF

    Abstract
    In 3 experiments, the authors investigated the ability of observers to extract the probabilities of successive shape co-occurrences during passive viewing. Participants became sensitive to several temporal-order statistics, both rapidly and with no overt task or explicit instructions. Sequences of shapes presented during familiarization were distinguished from novel sequences of familiar shapes, as well as from shape sequences that were seen during familiarization but less frequently than other shape sequences, demonstrating at least the extraction of joint probabilities of 2 consecutive shapes. When joint probabilities did not differ, another higher-order statistic (conditional probability) was automatically computed, thereby allowing participants to predict the temporal order of shapes. Results of a single-shape test documented that lower-order statistics were retained during the extraction of higher-order statistics. These results suggest that observers automatically extract multiple statistics of temporal events that are suitable for efficient associative learning of new temporal features.

  • "Eye-grabbing Insights", Bruce Bower Science News, Nov. 9, 2002; 162, 19: 293.
    .PDF

2001

  • "To What Extent Can Matching Algorthms Based on Direct Outputs of Low Level Generic Descriptors Account for Human Object Recognition?", József Fiser and Eric E. Cooper, 2001 March 2; 1-35.
    Abstract
    / .PDF

    Abstract
    A number of recent successful models of face recognition posit only two layers, an input layer consisting of a lattice of spatial filters and a single subsequent stage by which those descriptor values are mapped directly onto an object representation layer by standard matching methods such as stochastic optimization. Is this approach sufficient for modeling human object recognition? We tested whether a highly efficient version of such a two-layer model would manifest effects similar to those shown by humans when given the task of recognizing images of objects that had been employed in a series of psychophysical experiments. System accuracy was quite high overall, but was qualitatively different from that evidenced by humans in object recognition tasks. The discrepancy between the system's performance and human performance is likely to be revealed by all models that map filter values directly onto object units. These results suggest that human object recognition (as opposed to face recognition) may be difficult to approximate by models that do not posit hidden units for explicit representation of intermediate entities such as edges, viewpoint invariant classifiers, axes, shocks and/or object parts.

  • "Unsupervised Statistical Learning of Higher-Order Spatial Structures from Visual Scenes", József Fiser and Richard N. Aslin, Psychological Science, November 2001; 12: 499-504.
    Abstract / .PDF

    Abstract
    Three experiments investigated the ability of human observers to extract the joint and conditional probabilities of shape co-occurrences during passive viewing of complex visual scenes. Results indicated that statistical learning of shape conjunctions was both rapid and automatic, as subjects were not instructed to attend to any particular features of the displays. Moreover, in addition to single-shape frequency, subjects acquired in parallel several different higher-order aspects of the statistical structure of the displays, including absolute shape-position relations in an array, shape-pair arrangements independent of position, and conditional probabilities of shape co-occurrences. Unsupervised learning of these higher-order statistics provides support for Barlow's theory of visual recognition, which posits that detecting "suspicious coincidences" of elements during recognition is a necessary prerequisite for efficient learning of new visual features.

  • "Size Tuning in the Absence of Spatial Frequency Tuning in Object Recognition", József Fiser, Suresh Subramaniam, Irving Biederman, Vision Research, 2001; 41, 1931-1950.
    Abstract / .PDF

  • Abstract
    How do we attend to objects at a variety of sizes as we view our visual world? Because of an advantage in identification of lowpass over highpass filtered patterns, as well as large over small images, a number of theorists have assumed that size-independent recognition is achieved by spatial frequency (SF) based coarse-to-fine tuning. We found that the advantage of large sizes or low SFs was lost when participants attempted to identify a target object (specified verbally) somewhere in the middle of a sequence of 40 images of objects, each shown for only 72 ms, as long as the target and distractors were the same size or spatial frequency (unfiltered or low or high bandpassed). When targets were of a different size or scale than the distractors, a marked advantage (pop out) was observed for large (unfiltered) and low SF targets against small (unfiltered) and high SF distractors, respectively, and a marked decrement for the complementary conditions. Importantly, this pattern of results for large and small images was unaffected by holding absolute or relative SF content constant over the different sizes and it could not be explained by simple luminance- or contrast-based pattern masking. These results suggest that size/scale tuning in object recognition was accomplished over the first several images (576 ms) in the sequence and that the size tuning was implemented by a mechanism sensitive to spatial extent rather than to variations in spatial frequency.
  • "Invariance of Long-term Visual Priming to Scale Reflection, Translation, and Hemisphere", József Fiser and Irving Biederman, Vision Research, 2001; 41, 221-234.
    Abstract / .PDF

  • Abstract
    The representation of shape mediating visual object priming was investigated. In two blocks of trials, subjects named images of common objects presented for 185 ms that were bandpass filtered, either at high (10 cpd) or at low (2 cpd) center frequency with a 1.5 octave bandwidth, and positioned either 5° right or left of fixation. The second presentation of an image of a given object type could be filtered at the same or different band, be shown at the same or translated (and mirror reflected) position, and be the same exemplar as that in the first block or a same-name different-shaped exemplar (e.g. a different kind of chair). Second block reaction times (RTs) and error rates were markedly lower than they were on the first block, which, in the context of prior results, was indicative of strong priming. A change of exemplar in the second block resulted in a significant cost in RTs and error rates, indicating that a portion of the priming was visual and not just verbal or basic-level conceptual. However, a change in the spatial frequency (SF) content of the image had no effect on priming despite the dramatic difference it made in appearance of the objects. This invariance to SF changes was also preserved with centrally presented images in a second experiment. Priming was also invariant to a change in left–right position (and mirror orientation) of the image. The invariance over translation of such a large magnitude suggests that the locus of the representation mediating the priming is beyond an area that would be homologous to posterior TEO in the monkey. We conclude that this representation is insensitive to low level image variations (e.g. SF, precise position or orientation of features) that do not alter the basic part-structure of the object. Finally, recognition performance was unaffected by whether low or high bandpassed images were presented either in the left or right visual field, giving no support to the hypothesis of hemispheric differences in processing low and high spatial frequencies.
  • "Size Invariance in Visual Object Priming of Gray Scale Images", József Fiser and Irving Biederman, Perception, March 2, 2001.
    Abstract / .PDF

    Abstract

    The strength of visual priming of briefly presented gray scale pictures of real world objects, measured by naming reaction times and errors, was independent of whether the primed picture of the object was presented in the same or different size than the original picture. These findings replicate Biederman & Cooper’s (1992) results on size invariance in shape recognition, which were obtained with line drawings, and extend them to the domain of gray level images. Entry-level shape identification is based either predominantly on scale-invariant representations incorporating orientation and depth discontinuities which are well captured by line drawings, or both discontinuities and the representation derived from smooth gradual surface changes are scale invariant.

2000
  • "Experience-dependent Visual Cue Intergration Based on Consistencies Between Visual & Haptic Percepts", Joseph E. Atkins, József Fiser and Robert A. Jacobs, Vision Research, 2001; 41, 449-461.
    Abstract / .PDF

    Abstract

    We study the hypothesis that observers can use haptic percepts as a standard against which the relative reliabilities of visual cues can be judged, and that these reliabilities determine how observers combine depth information provided by these cues. Using a novel visuo-haptic virtual reality environment, subjects viewed and grasped virtual objects. In Experiment 1, subjects were trained under motion relevant conditions, during which haptic and visual motion cues were consistent whereas haptic and visual texture cues were uncorrelated, and texture relevant conditions, during which haptic and texture cues were consistent whereas haptic and motion cues were uncorrelated. Subjects relied more on the motion cue after motion relevant training than after texture relevant training, and more on the texture cue after texture relevant training than after motion relevant training. Experiment 2 studied whether or not subjects could adapt their visual cue combination strategies in a context-dependent manner based on context-dependent consistencies between haptic and visual cues. Subjects successfully learned two cue combination strategies in parallel, and correctly applied each strategy in its appropriate context. Experiment 3, which was similar to Experiment 1 except that it used a more naturalistic experimental task, yielded the same pattern of results as Experiment 1 indicating that the findings do not depend on the precise nature of the experimental task. Overall, the results suggest that observers can involuntarily compare visual and haptic percepts in order to evaluate the relative reliabilities of visual cues, and that these reliabilities determine how cues are combined during three-dimensional visual perception.
  • "Minimizing Binding Errors Using Learned Conjunctive Features", Bartlett W. Mel and József Fiser, Neural Computation, 12, 247-278.
    Abstract / .PDF

  • Abstract
    We have studied some of the design trade-offs governing visual representations based on spatially invariant conjunctive feature detectors, with an emphasis on the susceptibility of such systems to false-positive recognition errors—Malsburg’s classical binding problem.We begin by deriving an analytical model that makes explicit how recognition performance is affected by the number of objects that must be distinguished, the number of features included in the representation, the complexity of individual objects, and the clutter load, that is, the amount of visual material in the field of view in which multiple objects must be simultaneously recognized, independent of pose, and without explicit segmentation. Using the domain of text to model object recognition in cluttered scenes, we show that with corrections for the nonuniform probability and nonindependence of text features, the analytical model achieves good fits to measured recognition rates in simulations involving a wide range of clutter loads, word sizes, and feature counts.We then introduce a greedy algorithm for feature learning, derived from the analytical model, which grows a representation by choosing those conjunctive features that are most likely to distinguish objects from the cluttered backgrounds in which they are embedded.We show that the representations produced by this algorithm are compact, decorrelated, and heavily weighted toward features of low conjunctive order. Our results provide a more quantitative basis for understanding when spatially invariant conjunctive features can support unambiguous perception in multiobject scenes, and lead to several insights regarding the properties of visual representations optimized for specific recognition tasks.

1999

  • "Subordinate-level Object Classification Reexamined" Irving Biederman, Suresh Subramaniam, Moshe Bar, Peter Kalocsai, József Fiser, Psychological Research, 1999; 62: 131-153.
    Abstract / .PDF



    Abstract

    The classication of a table as round rather than square, a car as a Mazda rather than a Ford, a drill bit as 3/8-inch rather than 1/4-inch, and a face as Tom have all been regarded as a single process termed ``subordinate classi®cation.'' Despite the common label, the considerable heterogeneity of the perceptual processing required to achieve such classifications requires, minimally, a more detailed taxonomy. Perceptual information relevant to subordinate-level shape classications can be presumed to vary on continua of (a) the type of distinctive information that is present, nonaccidental or metric, (b) the size of the relevant contours or surfaces, and (c) the similarity of the to-be-discriminated features, such as whether a straight contour has to be distinguished from a contour of low curvature versus high curvature. We consider three, relatively pure cases. Case 1 subordinates may be distinguished by a representation, a geon structural description (GSD), specify ing a nonaccidental characterization of an object's large parts and the relations among these parts, such as a round table versus a square table. Case 2 subordinates are also distinguished by GSDs, except that the distinctive GSDs are present at a small scale in a complex object so the location and mapping of the GSDs are contingent on an initial basic-level classi®cation, such as when we use a logo to distinguish various makes of cars. Expertise for Cases 1 and 2 can be easily achieved through specification, often verbal, of the GSDs. Case 3 subordinates, which have furnished much of the grist for theorizing with "view-based" template models, requireone metric discriminations. Cases 1 and 2 account for the overwhelming majority of shape-based basic- and subordinate-level object classifications that people can and do make in their everyday lives. These classifications are typically made quickly, accurately, and with only modest costs of viewpoint changes. Whereas the activation of an array of multiscale, multiorientation filters, presumed to be at the initial stage of all shape process ing, may suffce for determining the similarity of the representations mediating recognition among Case 3 subordinate stimuli (and faces), Cases 1 and 2 require that the output of these flters be mapped to classifiers that make explicit the nonaccidental properties, parts, and relations specified by the GSDs.

1998
  • "Distance Modulation of Neural Activity in the Visual Cortex", Allan C. Dobbins, Richard M. Jeo, József Fiser, John M. Allman, Science, 24 July 1998; 281 (5376):552-555.
    Abstract / .PDF / Commentary

    Abstract


    Humans use distance information to scale the size of objects. Earlier studies demonstrated changes in neural response as a function of gaze direction and gaze distance in the dorsal visual cortical pathway to parietal cortex. These findings have been interpreted as evidence of the parietal pathway's role in spatial representation. Here, distance-dependent changes in neural response were also found to be common in neurons in the ventral pathway leading to inferotemporal cortex of monkeys. This result implies that the information necessary for object and spatial scaling is common to all visual cortical areas.

Top of Page