Supplemental MaterialsThe material below is intended to supplement selected publications by Peter Birkholz and coworkers. They contain, for example, autio stimuli that were used in perception experiments described in the papers.
Stone S, Birkholz P (submitted). Articulation-to-speech using electro-optical stomatography and articulatory synthesis.
In this study we controlled the articulatory speech synthesizer VocalTractLab with articulatory movements captured by electro-optical stomatography, i.e., with contact and distance sensors on an artificial palate that is worn in the mouth. This allows to generate audible speech from silently generated speech movements. The supplemental material here provides the audio files for a range of generated speech sounds with the system.
Stone S, Azgin A, Mänz S, Birkholz P (submitted). Prospects of articulatory text-to-speech synthesis.
In this study we outline our architecture for an articulatory text-to-speech synthesis system based on VocalTractLab. With the audio files in this supplemental material we demonstrate a variety of prosodic modifications that can be applied to the generated utterances using simple rules. Please download the examples here.
Stone S, Schmidt P, Birkholz P (submitted). Prediction of voicing and the f0 contour from electromagnetic articulography data for articulation-to-speech synthesis.
In this study we examined to which extent EMA data allow the prediction of voicing and f0 of the speech signal. Predicted f0 contours were imposed on the natural speech recordings performed during the EMA measurements and used in a perception experiment. The stimuli of the experiment can be found here.
Birkholz P, Drechsel S, Stone S (2019). Perceptual optimization of an enhanced geometric vocal fold model for articulatory speech synthesis. Interspeech 2019
In this study we examined which control parameter settings of an enhanced geometric vocal fold model would generate the most natural-sounding male and female voices. The male samples (five German words) with the best parameter settings can be played here. The female samples with the best parameter settings can be played here.
Birkholz P, Gabriel F, Kürbis S, Echternach M (2019). How the peak glottal area affects linear predictive coding-based formant estimates of vowels. The Journal of the Acoustical Society of America, 146(1), pp. 223-232.
The supplemental material here contains the CAD files for 3D-printing the ten vocal tract resonators, the CAD files needed to create the silicone vocal fold model, and the audio files of the synthetic vowels generated with the physical models and the computer simulations.
Birkholz P, Stone S, Kürbis S (2019). Comparison of different methods for the voiced excitation of physical vocal tract models. In: Birkholz P, Stone S (eds.) Studientexte zur Sprachkommunikation: Elektronische Sprachsignalverarbeitung 2019 (TUDPress, Dresden)
The supplemental material here contains the CAD files for 3D-printing the eight vocal tract resonators for the eight long German vowels and the audio files of the stimuli generated using the different excitation methods.
Birkholz P, Pape D (submitted). How modeling entrance loss and flow separation in a two-mass model affects the oscillation and synthesis quaity.
The supplemental material here contains the audio stimuli used for the perception experiments.
Stone S, Marxen M, Birkholz P (2018). Construction and Evaluation of a Parametric One-Dimensional Vocal Tract Model. IEEE/ACM Transactions on Audio, Speech and Language Processing, 26(8), pp. 1381-1392. doi: 10.1109/TASLP.2018.2825601
The supplemental material here contains the audio stimuli used for the perception experiments and an additional table with parameter values.
Klause F, Stone S, Birkholz P (2017). A head-mounted camera system for the measurement of lip protrusion and opening during speech production. In: Trouvain J, Steiner I, Möbius B (eds.) Studientexte zur Sprachkommunikation: Elektronische Sprachsignalverarbeitung 2017 (TUDPress, Dresden), pp. 145-151 [pdf]
This paper presents a head-mounted camera system for the simultaneous measurement of lip protrusion and opening during speech production. All necessary files for the construction of the camera helmet, and the software and scripts for the extraction of the parameters can be downloaded here.
Birkholz P, Martin L, Xu Y, Scherbaum S, Neuschaefer-Rube C (2016). Manipulation of the prosodic features of vocal tract length, nasality and articulatory precision using articulatory synthesis. Computer Speech & Language, 41, pp. 116-127
In this study we examined the articulatory generation of secondary prosodic features using articulatory speech synthesis. Therefore, we manipulated certain articulatory features of re-synthesized German words and asked listeners to rate the prosodic effects. We applied the following articulatory manipulations:
- Best possible resynthesis of the natural utterance (no manipulation)
- Simulation of a longer vocal tract (larynx lowering and lip protrusion)
- Simulation of a shorter vocal tract (larynx raising and lip retraction)
- Nasalized articulation of all sonorants
- Reduced articulatory effort (lower speed of target approximation)
- Increased articulatory effort (higher speed of target approximation)
- Slight centralization of all vowels and consonants (towards an indifferent articulation)
- Strong centralization of all vowels and consonants
Birkholz P, Martin L, Willmes K, Kröger BJ, Neuschaefer-Rube C (2015). The contribution of phonation type to the perception of vocal emotions in German: an articulatory synthesis study. Journal of the Acoustical Society of America, 137(3), pp. 1503–1512
In this study we examined how the sole change of phonation type can change the perception of vocal emotions in re-synthesized portrayed emotional utterances, when other prosodic parameters (phone duration, pitch) remain the same as in the original utterance. Below are the stimuli for one of the examined sentences ("Der Lappen liegt auf dem Eisschrank"), which was originally spoken with seven emotional expressions. For each emotion, you will first hear the original utterance re-synthesized as good as possible with regard to phonation type, phone duration, and pitch. Then you hear the same sentence with the phonation type replaced by purely breathy voice, purely modal voice, and purely pressed voice. here.
Mumtaz R, Preuß S, Neuschaefer-Rube C, Hey C, Sader R, Birkholz P (2014). Tongue Contour Reconstruction from Optical and Electrical Palatography. IEEE Signal Processing Letters, 21(6), pp. 658-662
In this study we examined the potential of optical and electrical palatography (OPG and EPG) for the reconstruction of the tongue contour. Therefore, we extracted the tongue contour and the corresponding (virtual) EPG and OPG data from MRI corpora of the vocal tract of two different speakers, and trained linear models to predict the tongue contours based on the sensor data.
This supplemental material contains the tongue and vocal tract contours extracted from the MRI samples as svg files, the tongue points and simulated sensor data in Excel tables, and Matlab scripts for the cross-validation and EPG index calculation.
Birkholz P, Kröger BJ, Neuschaefer-Rube C (2011). Articulatory synthesis of words in six voice qualities using a modified two-mass model of the vocal folds. In: Proc. of the 1st International Workshop on Performative Speech and Singing Synthesis, Vancouver, BC [pdf]
In this study we examined the capability of our extended two-mass model of the vocal folds to synthesize words in different voice qualities. The following stimuli were synthesized and judged by listeners with respect to the perceived voice quality:
- Intended normal voice quality
- Intended pressed voice quality
- Intended breathy voice quality
- Intended whispery voice quality (pressed breathy/murmur)
- Intended vocal fry
- Intended falsetto
Birkholz P, Kröger BJ, Neuschaefer-Rube C (2010). Stimmsynthese mit einem Zwei-Massen-Modell der Stimmlippen mit dreieckigem Öffnungsquerschnitt. In 27th Jahrestagung der DGPP, Aachen, Germany [pdf]
- Synthetic vowel stimuli for several steps on the continuum from the minimal to the maximal glottal rest area using the classical two-mass-model of the vocal folds (wav). All stimuli were perceived as a slightly pressed to normal voice quality.
- Synthetic vowel stimuli for several steps on the continuum from the minimal to the maximal glottal rest area using the new two-mass-model of the vocal folds with a triangular glottis shape in the rest position (wav). As in reality, the voice quality changes from pressed over normal to breathy when the glottal rest area increases.