High-frequency spectral content and the perceived buzziness of sung tones

David M. Howard, Sten Ternström

Model-based singing synthesisers tend to produce vowels that sound ‘buzzy’. In this investigation, we attempt to identify some acoustic features that cue the perception of ‘buzziness.’ Twenty-five non-naïve, normally hearing (self-reported) listeners ranked the buzziness of different synthesised and live stimulus tones representing a sung vowel of two seconds duration. No instructions were given as to what ‘buzziness’ means. In a pilot experiment, the stimulus tones were systematically varied with regard to high-frequency (HF) content, F0 regularity, and synthetic versus live origin. The results showed that HF content was the strongest correlate of buzziness, and that adding a sinusoidal vibrato or random flutter reduced the buzziness somewhat. The live tones were ranked as somewhat less buzzy than the synthesised ones. The results were very similar with listeners in both York and Stockholm. These findings prompted a second experiment, using fewer factors and smaller changes in HF content, which confirmed that listeners can perceive even small spectrum changes at low levels above 5 kHz. The findings suggest that attention needs to be paid to the highest octave of the synthesised spectrum in order to achieve high-fidelity model-based singing synthesis.

David M. Howard
Dept of Electronics
University of York, UK
dh@ohm.york.ac.uk

Sten Ternström
Dept of Speech, Music and Hearing,
Kungliga Tekniska Högskolan,
Stockholm, Sweden
stern@kth.se