Notes on selected articles by Klaus R. Scherer (and collaborators) on Vocal Affect Expression
Scherer and Oshinsky (1977). Synthesized tone sequences, whose major acoustic parameters had been systematically manipulated, were rated on scales of pleasantness, activity, and potency as well as on various emotion scales. The table below (reproduced from Scherer and Oshinsky (1977), p. 340), summarizes the results.
Acoustic Parameters of Tone Sequences Significantly Contributing to the Variance of Attributions of Emotional States
|Rating Scale||Single acoustic parameters (main effects) and configurations (interaction effects) listed in order of predictive strength|
|Pleasantness||Fast tempo, few harmonics, large pitch variation, sharp envelope, low pitch level, pitch contour down, small amplitude variation (salient configuration: large pitch variation plus pitch contour up)|
|Activity||Fast tempo, high pitch level, many harmonics, large pitch variation, sharp envelope, small amplitude variation|
|Potency||Many harmonics, fast tempo, high pitch level, round envelope, pitch contour up (salient configurations: large amplitude variation plus high pitch level, high pitch level plus many harmonics)|
|Anger||Many harmonics, fast tempo, high pitch level, small pitch variation, pitch contours up (salient configuration: small pitch variation plus pitch contour up)|
|Boredom||Slow tempo, low pitch level, few harmonics, pitch contour down, round envelope, small pitch variation|
|Disgust||Many harmonics, small pitch variation, round envelope, slow tempo (salient configuration: small pitch variation plus pitch contour up)|
|Fear||Pitch contour up, fast sequence, many harmonics, high pitch level, round envelope, small pitch variation (salient configurations: small pitch variation plus pitch contour up, fast tempo plus many harmonics)|
|Happiness||Fast tempo, large pitch variation, sharp envelope, few harmonics, moderate amplitude variation (salient configurations: large pitch variation plus pitch contour up, fast tempo plus few harmonics)|
|Sadness||Slow tempo, low pitch level, few harmonics, round envelope, pitch contour down (salient configuration: low pitch level plus slow tempo)|
|Surprise||Fast tempo, high pitch level, pitch contour up, sharp envelope, many harmonics, large pitch variation (salient configuration: high pitch level plus fast tempo)|
Scherer's main interest appears to be speech and not music. This 1977 study used tone sequences as an approximation of speech, perhaps because systematic manipulation of voice samples was too difficult at that time. The remaining studies concern themselves primarily with speech.
Human beings are quite good at determining each others' emotional states. When asked to identify the emotional states of paid actors on the basis of vocal utterances alone (constructed nonsensically from common Indo-European phonemes), experimental participants achieve an average accuracy of about 50% across all emotion categories (although some like joy are higher and some like disgust are lower). This might not seem very accurate, but it is significantly better than the performance expected by chance (i.e. random guessing). When combined with other channels of emotional communication (such as facial expression and posture), vocal cues are clearly a powerful interpretive tool. (For experimental reports, see the 1991, 1996a, and 2001 papers.)
Emotional communication involves both an encoding and a decoding process. On the encoding side, an expression may arise as a spontaneous reaction (as, for example, when one grunts "uggghh" upon tasting something horrible), as an intentional communication of one's emotional state, or as some combination of both. The fact that emotional expressions may arise from various mixtures of sponaneity and intentionality should not be a concern since in all cases the mechanisms of expression have evolved to communicate adaptive information to other organisms. Thus Scherer et al. (2001) conclude that there is no essential difference between using "naturally occuring" emotions and using actors' portrayals of emotions in studies of this nature. On the decoding side, the fact that we are able to discriminate emotions on the basis of hearing alone suggests that, at least in theory, we should be able to identify specific configurations of acoustical parameters for each emotion. On the other hand, the 50% accuracy rate also suggests that the acoustic configurations are subtle and may overlap considerably, probably due to overlaps in the psychological and physiological processes underlying the emotions themselves.
In the 1986b article, Scherer laments the diverse methodologies and somewhat equivocal results of vocal emotion research up to that time. In order to provide a coherent foundation for future empirical research, he proposes a series of speculative but rigorously logical hypotheses on the basis of the following model of vocal affect expression:
cognitive appraisal --> psychophysiological changes --> changes to the vocal production apparatus --> acoustical changes
Along the way he proposes a new model of emotion, the component process model, which can be seen as an expansion of the appraisal theory first proposed by Magda Arnold in 1960. A key ingredient of this model is a sequence of stimulus evaluation checks, summarized in the table below:
Sequence of stimulus evaluation checks (SECs)
first proposed in (Scherer 1986b); shown below as summarized in Scherer (1989).
1. Novelty check. Evaluating whether there is a change in the pattern of external or internal stimulation, particularly whether a novel event has occurred or is to be expected.
2. Intrinsic pleasantness check. Evaluating whether a stimulus event is pleasant, inducing approach tendencies, or unpleasant, inducing avoidance tendencies, based on innate feature detectors or on learned associations.
3. Goal/need significance check. Evaluating whether a stimulus event is relevant to important goals or needs of the organism (relevance subcheck), whether the outcome is consistent with, or discrepant from, the state expected for this point in the goal/plan sequence (expectation subcheck), whether it is conducive or obstructive to reaching the respective goals or satisfying the relevant needs (conduciveness subcheck), and how urgently some kind of behavioural response is required (urgency subcheck).
4. Coping potential check. Evaluating the causation of a stimulus event (causation subcheck) and the coping potential available to the organism, particularly the degree of control over the event or its consequences (control subcheck), the relative power of the organism to change or avoid the outcome through fight or flight (power subcheck), and the potential for adjustment to the final outcome via internal restructuring (adjustment subcheck).
5. Norm/self compatibility check. Evaluating whether the event, particularly an action, conforms to social norms, cultural conventions, or expectations of significant others (external standards subcheck), and whether it is consistent with internalized norms or standards as part of the self-concept or ideal self (internal standards subcheck).
The component process model "proposes specific changes in the various subsystems of the organism which are seen to subserve emotion (physiological responses, motor expression, motivational tendencies, subjective feeling states). Thus, the outcome of each check is seen to affect all the different emotion components in a 'value-added' function. Given that the organism constantly evaluates and reevaluates ongong stimulation on the basis of these checks, one can expect constant modifications of the state of the various subsystems on the basis of the sequences of changes in the outcomes of the checks" (1989). In other words, emotional states are not static, but in constant flux as the appraisal process moves through its various components; nevertheless, the particular "pathway" through the SECs should leave a tell-tale trace on the final outcome of the organism's physiological state and hence on the acoustic parameters of the vocal utterance.
The table below presents the predicted acoustical changes for various emotions.
Changes predicted for selected acoustic parameters [by emotion]
|F0: Perturbation||< or =||>||>||>||>||>|
|F0: Range||< or =||>||<||>||>>||<||>|
|F0: Shift regularity||=||<||<||<||>|
|Intensity: Range||< or =||>||<||>||>||>|
By way of example, Scherer hypothesizes that the speech of a person experience grief or desperation is characterized by increases in the perturbation, mean, range, variability, contour, and shift regularity of the fundamental frequency; the first formant mean as well as the formant precision should increase, while the second formant mean should decrease and the first formant bandwidth should decrease markedly. The mean intensity should increase, the frequency range and amount of high-frequency energy should increase markedly, and the rate of speech should increase (with a concomitant decrease in transition time, i.e. time lag between utterances).
The ambitious 1996a study was designed specifically to test the predictions made in (1986b), as summarized in the chart above. Many of these predictions are supported, although some need to be revised on the basis of empirical contradiction. (No such succinct revision has yet been offerred; see Scherer (1996a) for experimental results.) As mentioned above, consistent error patterns are also instructive since they may suggest certain similarities in underlying processes among emotions.
The study of emotional cues in human speech can provide a solid framework for the exploration of emotional cues in music. The various acoustic parameters related to speech can be adapted to music: fundamental frequency relates to pitch and melody, intensity relates to dynamics, the formants relate to timbre, and speech rate / transition time relates to rhythm / duration.
The 2001 paper reports the results of an emotion encoding/decoding experiment carried out simultaneously in Germany, Switzerland, Great Britian, the Netherlands, the United States, Italy, France, Spain, and Indonesia. The data show an overall accuracy of 66% across all emotions and countries, suggesting the existence of similar inference rules from vocal expression across cultures. However, accuracy generally decreased with increasing language dissimilarity from German, suggesting that culture- and language-specific paralinguistic patterns may influence the decoding process. (Portions excerpted from the abstract.)
Some of the predictions (in 1986a) about the underlying physiological mechanisms as determinants of acoustic parameters may have been susceptible to post hoc reasoning, i.e. proposing mechanisms after the fact to match some already-observed acoustic correlates in the literature reviewed. (Scherer says as much himself, but feels that this potential problem cannot be avoided.)
Scherer has deferred the study of suprasegmental factors (e.g. prosodic cues such as intonation, rhythm, and timing) until such time as (segmental) acoustic parameters have been exhaustively studied. Perhaps this bottom-up approach risks missing or downplaying essential features of emotional communication.