Search for a command to run...
Automatic recognition of a speaker’s emotions is a natural objective for research, but is difficult to gauge the level of performance that is currently attainable. We describe a study that offers a rough benchmark. Speech data came from five passages of about 100 syllables each. They had been selected following pilot studies because they were effective at evoking specific emotion fear, anger, happiness, sadness, and neutrality. 40 subjects were recorded reading them. A battery of 32 potentially relevant features was extracted using our ASSESS system. They were broadly speaking prosodic, derived from contours tracing the movement of intensity and pitch. They were input to statistical decision mechanisms, of two types. Discriminant analysis uses linear combinations of variables to separate samples that belong to different categories. There are reasons to suspect that linear combination will not be appropriate, so neural net classifiers were also considered. An automatic relevance determination procedure was used to identify the most relevant parameters. Discriminant analysis outperformed the neural networks. Using 90% of the data for training, and testing on the remaining 10%, a classification rate of 55 % (+/0.08%) was achieved. The most useful predictors covered a variety of properties – intensity (relative to the start of the passage) and its spread; pitch spread; durations of silences, rises in intensity, and syllables; and a property related to the shape of ‘tunes’, the number of inflections in the F0 contour per tune. Many more variables were less important, but nevertheless contributed.