Auditory Scene Analysis: The Perceptual Organization of Sound by Albert S. Bregman

Reviewed by David Huron

Psychology of Music, Vol. 19, No. 1 (1991) pp. 77-82.
Albert S. Bregman, Auditory Scene Analysis: The Perceptual Organization of Sound. Cambridge, Massachusetts: MIT Press, 1990. 773pp. ISBN 0-262-02297-4 (hard cover).

This massive tome is the culmination of more than two decades of research by one of the leading figures in auditory perception -- Albert Bregman. Over the years, a constant series of papers has issued forth from Bregman's Montreal lab -- nearly all dealing with the formation of auditory images. In this book, Bregman has brought it all together and given us a lucid and masterful theory of how listeners make sense of the world of sound

Bregman begins by asking the question, what is the purpose of perception? He suggests that our perceptual faculties evolved as a means of allowing us to construct a useful representation of reality. Perception is functional and ecological -- providing us with the what, when, and where of the events around us. The primary task of the auditory system is to arrange the cacophony of frequency wisps into meaningful clumps that correspond to various real-world activities. In short, the act of hearing may be likened to the work of a cartographer constantly drafting maps of the auditory scene.

Some sounds (such as the slamming of a door) mark the occurrence of unique events. But the world of sound is not merely a succession of momentary incidents. Even discrete sounds -- such as a series of footsteps or the dripping of a tap -- are often caused by an on-going coherent activity. Most sounds have a lineage or history. The mental images we form of such "lines of sound" Bregman has dubbed auditory streams, and the study of the behavior of such images is the study of auditory streaming. Since the recognition of events depends upon the proper assignment of auditory properties to different sound sounrces, auditory streaming is fundamental to the process of scene analysis -- which in turn is fundamental to music perception.

There is a long history of research pertaining to the formation of auditory images. A partial list of contributors would include Hermann von Helmholtz, Carl Stumpf, Otto Ortmann, George Miller, George Heise, Donald Norman, Jay Dowling, Diana Deutsch, David Wessel and Stephen McAdams. In the realm of music, the work of Leo van Noorden (1975) is especially outstanding. However, none of the above researchers have pursued the topic with such sustained conviction and in such detail as Al Bregman

Auditory streaming entails two complementary domains of study. How sounds cohere to form a sense of continuation is the subject of stream fusion. Since more than one source can sound concurrently, a second domain of study is how concurrent activities retain their independent identities -- the subject of stream segregation. In general, individaul sounds tend to coalesce into a single percept in proportion to the physical correlations shared by the parts. Stream-determining factors include: timbre (spectral shape), fundamental frequency (pitch) proximity, temporal proximity, harmonicity, intensity, and spatial origin. In addition, when sounds evolve with respect to time, it is possible for them to share similarities by virtue of evolving in the same way. In Gestalt psychology, this perceptual co-evolution of parts is known as the principle of common fate. Bregman has pointed out that the formation of an auditory stream is governed largely by this principle.

Bregman's functional/ecological theory of streaming is both sensible and satisfying. However, one difficulty with this view is that spatial location is a relatively weak factor contributing to auditory streaming. One would have thought that location would provide the strongest cue in the construction of an ecological representation since one of the best generalizations that can be made about independent sound sources is that they normally occupy distinct positions in space. Bregman suggests that due to reverberation and the transparency of sound, localization cues are comparatively unreliable (p.83). Although reverberation can indeed confound localization, the arguments here are not especially compelling. There is no mention of the Haas or Precedent Effect, or citing of the literature demonstrating surprisingly good monaural localization abilities. If human localization abilities are truly poor, a functional/ecological theory needs to explain why a better system hadn't evolved if the pre-eminent function of the auditory system is the parsing of streams required for scene analysis. The relative unimportance of localization in stream formation suggests that the ecological account may be incomplete.

An important distinction Bregman makes is between primitive segregation and schema-based segregation. Primitive segregation is a bottom-up process whereby streams are parsed according to the correlations of acoustical cues. By contrast, scheme-based segregation is a top-down process that arises from experiential and cognitive factors. Schema-based streaming is characterized by voluntary or effortful listening -- an active "hearing-out" for a given pattern. Bregman postulates several differences by which primitive streaming can be distinguished from schema-based streaming. He suggests that in primitive streaming all frequencies will be assigned to one or another stream with no "unstreamed" residual components. In schema-based streaming,fusion of the foreground elements does not automatically result in the collective fusion of background elements. In other words, schema-based streaming may leave isolated "embellishment tones" that do not themselves cohere. Primitive and schema-based streaming differ also with regard to tempo. In primitive streaming, increasing the tempo of presentation always enhances the with-stream integration and between-stream segregation. However, in schema-based streaming, Bregman suggests that beyond a certain tempo, increasing the speed of presentation may tend to worsen the perceptual integration of the target stream since recognition of a familiar or predicted pattern may be lost.

One factor that Bregman claims does not contribute to schema-based segregation is pitch trajectory. The evidence against the auditory system extrapolating an existing pitch trajectory is both impressive and (initially) counter-intuitive. This phenomenon contrasts with the situation in vision where extrapolated motion is fundamental to the maintenance of visual images. Bregman suggests that there is some merit in the visual system expecting objects to behave in accordance with Newton's first law regarding momentum. But sound sources have no reason to act this way: "There is no inertia in a vocal tract that makes it probable that a rising pitch will continue to rise. In fact, the opposite could be true." (p.442) Indeed, the tendency is to stream tones within a stable tessitura. Bregman neatly summarizes these results in the motto "interpolation not extrapolation".

In the chapter concerning auditory organization in music, Bregman suggests that music may be regarded as a sort of "auditory fiction". Contrasted with other listening experiences, musical streams do not necessarily correspond with real sources in the world. Of course individual instruments such as trumpets and violins are truly real sources, but musicians like to combine such sources to form supra-source objects (such as multi-instrument "voices"). Stephen McAdams has called these virtual sources; Pierre Boulez has called them phantasmagoric instruments; Bregman proposes the term chimeric percepts:

"The Chimaera was a beast in Greek mythology with the head of a lion, the body of a goat, and the tail of a serpent. We use the word chimera metaphorically to refer to an image derived as a composition of other images. An example of an auditory chimera would be a heard sentence that was created by the accidental composition of the voices of two persons who just happened to be speaking at the same time. Natural hearing tries to avoid chimeric percepts, but music often tries to create them. It may want the listener to accept the simultaneous roll of the drum, clash of the cymbal,and brief pulse of noise from the woodwinds as a single coherent event with its own striking emergent properties. The sound is chimeric in the sense that it does not belong to any single environmental object." (pp.459-460)
In short, music listening may be profitably regarded as a type of scene analysis problem, with the exception that the auditory scenes are populated by a cast of mostly fictional sources.

There is plenty of evidence to suggest that a melody is a species of auditory stream. There is similarly plenty of evidence indicating that polyphonic music-making accords with the principles of auditory streaming. Bregman does not directly address the perception of homophonic textures, although he does suggest that music may be conceived in terms of hierarchies of streams. Partials may cohere into tones while tones may constitute chimeric entites normally called chords. Since chords may presumably be perceived as single entities, chord sequences might be able to form a single stream. There are several unexplored repercussions to this view. Normally the formation of a stream is signalled by, (1) the opacity of its constituent parts, and (2) the concurrent appearance of the emergent properties of the new whole. But how can a steam be hierarchically constituted of subordinate streams if its parts are supposed to be opaque? A good response might be that a stream can be regarded as an object of attention (at whatever level: partial, tone, chord, etc.). In this case, when perceiving a chord, the amalgamated partials of a constituent chordal tone don't really form a stream per se, but rather form a potential stream that is realized only with a shift of attention (from chord to chordal tone). Since Bregman proposes that primitive streaming is pre-attentive, the implication is that hierarchical stream organization is necessarily schema-based (or at least attention-driven).

One of the most musically innovative ideas in the book is the theory of dissonance developed in conjunction with James Wright. Wright and Bregman suggest that when two concurrent tones are captured by independent streams, their potential dissonance is suppressed or neutralized. Thus the degree to which a major seventh interval is perceived as dissonant depends upon how well the constituent tones are integrated into their respective horizontal voices.

"I think that if we were to hear, over headphones, a violin partita by Bach in one ear and piano sonata by Beethoven in the other, and if these were well segregated perceptually, a combination consisting of one note from the Bach and one from the Beethoven would be neither consonant nor dissonant." (p.521).
The theory is illustrated in Figure (from page 513). Due to the close within-voice pitch proximity the two diatonic scales in Figure 1a segregate well from each other. At the same time there is little or no perceived dissonance in example 1a. However, if the four through seventh intervals are extracted and rearranged so as to reduce the pitch proximity (and so reduce the horizontal streaming) the dissonances become evident (Figure 1b).

Figure 1a.


Figure 1b.

From this principle a full-fledged theory of non-chordal notes is developed. Wright and Bregman propose that the potential dissonance arising from non-chordal notes is controlled by ensuring good streaming. In practice, this means that most non-chordal notes will maintain close within-voice pitch proximity (i.e. antecedent and consequent step motion), and will be given asynchronous onsets. Passing notes, neighbor tones, suspensions, and anticipations all conform to these stringent streaming conditions -- whereas appoggiaturas and escape tones conform less well to the pitch proximity constraints. The most common types of non-chordal tones appear to be those that most contribute to the within-voice stream fusion.

Wright (1986) has suggested that the increasing dissonance over the course of the history of western music is reflected in the manner by which dissonant intervals are prepared. Over time, dissonances have been heightened by the dual practices of increasing the onset/offset synchronization of the tones forming the dissonant interval, and by decreasing the antecedent and consequent step motion. In short, the historical increase in musical dissonance is less attributable to the increasing prevalence of dissonant vertical moments, and more attributable to the weakening of horizontal streaming.

Interestingly, the Wright/Bregman theory is diametrically opposed to the theory of consonance proposed by Carl Stumpf (1898). Stumpf argued that the degree of perceived consonance in intervals is proportional to their tendency to fuse into a single percept. Intervals exhibiting simple frequency ratios are especially prone to tonal fusion (Verschmelzung) -- and hence are perceived as being most consonant. Stumpf (1926) later retracted this explanation. With the classic paper by Plomp and Levelt (1965), Helmholtz's theory of consonance and dissonance arising from the aggregate beating of adjacent partials was vindicated -- with an important modification arising from the influence of critical bands. The Wright/Bregman theory does not contradict the work of Plomp and Levelt.

Nevertheless, a number of issues arise from the Wright/Bregman theory. First, consider the notational examples in Figure 2. There is little difficulty hearing the two tones constituting the major seventh interval in Figure 2a. By adding the "e" and "g" in Figure 2b, two things happen: the individual notes are no longer easily resolved -- the sonority sounds more like a single chord rather than 4 notes. More importantly, by adding pitches, the dissonance of the major seventh interval has been considerably softened. This example poses a problem for the Wright/Bregman theory since a decrease in dissonance appears to accompany a decrease in the segregation of the auditory images.

Figure 2.

Second, Wright and Bregman argue that the existence of a good horizontal streaming permits the addition of non-chordal tones without suffering the penalty of undue dissonance. but the cause and effect may be reversed here. It may be that the goal of good horizontal streaming leads composers to add non-chordal tones in order to enhance the voice segregation. In short, the purpose of non-chordal tones may not be to add the spice of dissonance (without being too spicy); an equally plausible explanation may be that non-chordal tones are used to enhance the horizontal fusion of individual voices. The fact that even monophonic melodies make use of "non-chordal" tones (such as passing tones) lends credence to the idea that part of their purpose is to enhance horizontal streaming rather than to add dissonance. The Wright/Bregman theory of dissonance is nevertheless an important theory, and certainly worthy of experimental investigation.

Without doubt, this volume is destined to be a classic treatise in hearing sciences. However, unlike much of the hearing sciences literature, Auditory Scene Analysis deals with the type of higher level issues that begin to intersect significantly with truly musical concerns. The book's sheer length and technical detail is apt to intimidate many musician readers, but I cannot recommend this book too highly for serious scholars of music perception. Auditory Scene Analysis is a first-rate reference work that provides an exhaustive account of the state of research concerning the formation of auditory images. The book also articulates a unique and visionary theoretical framework that is bound to inspire a great deal of further research.

David Huron
Ohio State University


References

Plomp, R. and Levelt, W.J.M. (1965). Tonal consonance and critical bandwidth. Journal of the Acoustical Society of America, Vol. 38, pp. 548-560.

Stumpf, C. (1898). Konsonanz und Dissonanz. Beiträge zur Akustischen Musikwissenschaft, Vol. 1, pp. 1-108.

Stumpf, C. (1926). Die Sprachlaute. Berlin: Verlag J. Springer.

Van Noorden, L.P.A.S. (1975). Temporal Coherence in the Perception of Tone Sequences. Eindhoven University of Technology, doctoral dissertation.

Wright, J.K. (1986). Auditory Object Perception: Counterpoint in a New Context. McGill University, Master's thesis.


Return to Publication List

Return to Huron's Home Page

Return to "Music Cognition at Ohio State University"