Supplemental MaterialsThe material below is intended to supplement selected publications by Peter Birkholz and coworkers. They contain, for example, audio stimuli that were used in perception experiments described in the papers.
Blandin R, Geng J, Birkholz P (submitted). Experimental investigation of speech directivity mechanisms.
This study explored the combined influence of the torso, the lips and the vocal tract geometry on speech directivity. The supplemental material contains the measured and simulated directivity data of this study.
Blandin R, Stone S, Remacle A, Didone V, Birkholz P (submitted). A comparative study of 3D and 1D acoustic simulations of speech.
This study explored the perceptual impact of the high frequency content (> 4 kHz) of vowels synthesized with different degrees of physical realism, using a 1D acoustic simulation based on a transmission-line model of the vocal tract vs. using a 3D acoustic simulation based on the multimodal method vs. using an artificial bandwidth-extension algorithm. The supplemental material includes the synthetic stimuli and the code used to generate the stimuli.
Stone S, Birkholz P (submitted). Articulatory synthesis of German primary diphthongs using monophthong vocal tract shapes.
This study explored the articulatory synthesis of German primary diphthongs from different combinations of monophthong vocal tract shapes and their perceived naturalness. The supplemental material includes the data from the formant analysis of natural diphthongs, the synthesized audio files, etc.
Birkholz P, Ossmann S, Blandin R, Wilbrandt A, Krug P, Fleischer M (2022). Modeling speech sound radiation with different degrees of realism for articulatory synthesis.
This study proposed two models for the radiation characteristic, which are more realistic than the commonly used first-order high-pass filter, and explores their acoustic and perceptual impact. The supplemental material consists of the synthesized audio files, and the simulated radiation characteristics, also in terms of digital FIR filters for different common sampling rates.
Langheinrich I, Stone S, Zhang X, Birkholz P (2022). Glottal inverse filtering based on articulatory synthesis and deep learning.
This study proposed a method for glottal inverse filtering based on a neural network that was trained with speech and glottal flow signals generated by the articulatory speech synthesizer VocalTractLab. The supplemental material consists of the synthesized audio and glottal flow signals, as well as the text material, the baseline segment sequence files, the baseline and manipulated gestural score files, and the training and test split.
Birkholz P, Mayer CK, Häsner P (2022). Towards a soft fluidic elastomer tongue for a mechanical vocal tract.
This study presents the prototype of a soft fluidic elastomer tongue with three water-filled chambers. The tongue can be deformed by filling the chambers with varying volumes of water. The supplemental material contains the CAD parts of the vocal tract walls and the mold for the tongue, as well as the synthetic stimuli generated with the physical model.
Stone S, Abdul-Hak P, Birkholz P (2022). Perceptual cues for smiled voice - An articulatory synthesis study.
This study investigated the perceptual cues in continuous speech for smiled voice with an analysis-by-synthesis approch using the VocalTractLab articulatory synthesizer. The supplemental material contains the synthetic stimuli used in the perception experiment and the files needed to synthesize them.
Digehsara PA, Possamai de Menezes JV, Wagner C, Bärhold M, Schaffer P, Plettemeier D, Birkholz P (2022). A user-friendly headset for radar-based silent speech recognition.
This study introduces and evaluates a 3D-printable headset that was designed for a consistent (cross-session) placement of antennas on a users face for a radar-based silent speech interface. The supplemental material contains the STL files of the printable parts for the headset.
Wagner C, Schaffer P, Digehsara PA, Bärhold M, Plettemeier D, Birkholz P (2022). Silent speech command word recognition using stepped frequency continuous wave radar.
This study describes and evaluates the feasibility of using radar transmission spectra in the 1-6 GHz range for silent speech recognition, evaluated on a small-vocabulary German two-syllable corpus. The supplemental materials contains the complete command word corpus, i.e., every utterance saved with their respective radar spectrogram (binary file), a text label file and audio file (.wav). It also provides a Matlab script to display the individual samples side-by-side.
Birkholz P, Häsner P, Kürbis S (2022). Acoustic comparison of physical vocal tract models with hard and soft walls.
This study explored how the frequencies and bandwidths of the acoustic resonances of physical tube models of the vocal tract differ when they have hard vs. soft walls. The supplemental material here contains the area functions and the 3D-printable STL files of the tube models and the casting molds, and the measured transfer functions.
Birkholz P, Drechsel S (2021). Effects of the piriform fossae, transvelar acoustic coupling, and laryngeal wall vibration on the naturalness of articulatory speech synthesis. Speech Communication, 132, pp. 96-105. doi: 10.1016/j.specom.2021.06.002
This study explored how the piriform fossae, transvelar acoustic coupling, and laryngeal wall vibration affect the naturalness of articulatory speech synthesis with VocalTractLab. The supplemental material here contains the synthesized stimuli used in the experiments.
Blandin R, Arnela M, Felix S, Doc JB, Birkholz P (2021). Comparison of the finite element method, the multimodal method and the transmission-line model for the computation of vocal tract transfer functions. In Proc. of the Interspeech 2021, pp. 3330-3334, Brno, Czech Republic
In this study, three methods for the computation of the vocal tract transfer function were compared by means of eight 3D vocal tract shapes (four vowels generated by each a male and a female vocal tract model). The supplemental material here contains the 3D vocal tract shapes and a list with the obtained resonance frequencies and bandwidths.
Wagner C, Stappenbeck L, Wenzel H, Steiner P, Lehnert B, Birkholz P (2021). Evaluation of a non-personalized optopalatographic device for prospective use in functional post-stroke dysphagia therapy. IEEE Transactions on Biomedical Engineering
In this study, a retainer-free optopalatographic device is presented that is capable of measuring tongue movement trajectories in situ to provide real-time feedback of the articulation. The supplemental material here contains the sensor data acquired with the proposed device from a group of subjects during the performance of different tongue exercises that are commonly used in dysphagia therapy. The source code for the analysis and classification of the data is also contained.
Häsner P, Prescher A, Birkholz P (2021). Effect of wavy trachea walls on the oscillation onset pressure of silicone vocal folds. The Journal of the Acoustical Society of America, 149(1), pp. 466-475
In this study, the influence of non-smooth trachea walls on phonation onset and offset pressures and on the fundamental frequency of oscillation were experimentally investigated for three different synthetic vocal fold models. The supplemental material here contains a video that shows the formation of surface waves on the trachea walls when the trachea is streched or compressed, as well as the CAD and STL geometry files that were created for 3D-printing parts of the experimental setup and the vocal fold models.
Xue Y, Marxen M, Akagi M, Birkholz P (2021). Acoustic and articulatory analysis and synthesis of shouted vowels. Computer Speech & Language, 66, 101156, doi: 10.1016/j.csl.2020.101156
In this study we compared spoken and shouted vowels, both at the acoustic and articulatory levels. The supplemental material here contains the acoustic features, the cross-distance functions of the vocal tract, and the vocal tract contours of the spoken and shouted vowels analyzed in the study.
Stone S, Gao Y, Birkholz P (submitted). Articulatory synthesis of vocalized /r/ allophones in German.
In this study we examined vocalic /r/ allophones in German in the context of articulatory speech synthesis. The supplemental material here contains the audio recordings of the analyzed words with /r/ allophones as well as the synthesized words for the perception experiment.
Stone S, Birkholz P (2021). Articulation-to-speech using electro-optical stomatography and articulatory synthesis. ISSP 2021
In this study we controlled the articulatory speech synthesizer VocalTractLab with articulatory movements captured by electro-optical stomatography, i.e., with contact and distance sensors on an artificial palate that is worn in the mouth. This allows to generate audible speech from silently generated speech movements. The supplemental material here provides the audio files for a range of generated speech sounds with the system.
Stone S, Azgin A, Mänz S, Birkholz P (2021). Prospects of articulatory text-to-speech synthesis. ISSP 2021
In this study we outline our architecture for an articulatory text-to-speech synthesis system based on VocalTractLab. With the audio files in this supplemental material we demonstrate a variety of prosodic modifications that can be applied to the generated utterances using simple rules. Please download the examples here.
Stone S, Schmidt P, Birkholz P (2020). Prediction of voicing and the f0 contour from electromagnetic articulography data for articulation-to-speech synthesis. ICASSP 2020
In this study we examined to which extent EMA data allow the prediction of voicing and f0 of the speech signal. Predicted f0 contours were imposed on the natural speech recordings performed during the EMA measurements and used in a perception experiment. The stimuli of the experiment can be found here.
Birkholz P, Drechsel S, Stone S (2019). Perceptual optimization of an enhanced geometric vocal fold model for articulatory speech synthesis. Interspeech 2019
In this study we examined which control parameter settings of an enhanced geometric vocal fold model would generate the most natural-sounding male and female voices. The male samples (five German words) with the best parameter settings can be played here. The female samples with the best parameter settings can be played here.
Birkholz P, Gabriel F, Kürbis S, Echternach M (2019). How the peak glottal area affects linear predictive coding-based formant estimates of vowels. The Journal of the Acoustical Society of America, 146(1), pp. 223-232.
The supplemental material here contains the CAD files for 3D-printing the ten vocal tract resonators, the CAD files needed to create the silicone vocal fold model, and the audio files of the synthetic vowels generated with the physical models and the computer simulations.
Birkholz P, Stone S, Kürbis S (2019). Comparison of different methods for the voiced excitation of physical vocal tract models. In: Birkholz P, Stone S (eds.) Studientexte zur Sprachkommunikation: Elektronische Sprachsignalverarbeitung 2019 (TUDPress, Dresden)
The supplemental material here contains the CAD files for 3D-printing the eight vocal tract resonators for the eight long German vowels and the audio files of the stimuli generated using the different excitation methods.
Birkholz P, Pape D (submitted). How modeling entrance loss and flow separation in a two-mass model affects the oscillation and synthesis quaity.
The supplemental material here contains the audio stimuli used for the perception experiments.
Stone S, Marxen M, Birkholz P (2018). Construction and Evaluation of a Parametric One-Dimensional Vocal Tract Model. IEEE/ACM Transactions on Audio, Speech and Language Processing, 26(8), pp. 1381-1392. doi: 10.1109/TASLP.2018.2825601
The supplemental material here contains the audio stimuli used for the perception experiments and an additional table with parameter values.
Klause F, Stone S, Birkholz P (2017). A head-mounted camera system for the measurement of lip protrusion and opening during speech production. In: Trouvain J, Steiner I, Möbius B (eds.) Studientexte zur Sprachkommunikation: Elektronische Sprachsignalverarbeitung 2017 (TUDPress, Dresden), pp. 145-151 [pdf]
This paper presents a head-mounted camera system for the simultaneous measurement of lip protrusion and opening during speech production. All necessary files for the construction of the camera helmet, and the software and scripts for the extraction of the parameters can be downloaded here.
Birkholz P, Martin L, Xu Y, Scherbaum S, Neuschaefer-Rube C (2016). Manipulation of the prosodic features of vocal tract length, nasality and articulatory precision using articulatory synthesis. Computer Speech & Language, 41, pp. 116-127
In this study we examined the articulatory generation of secondary prosodic features using articulatory speech synthesis. Therefore, we manipulated certain articulatory features of re-synthesized German words and asked listeners to rate the prosodic effects. We applied the following articulatory manipulations:
- Best possible resynthesis of the natural utterance (no manipulation)
- Simulation of a longer vocal tract (larynx lowering and lip protrusion)
- Simulation of a shorter vocal tract (larynx raising and lip retraction)
- Nasalized articulation of all sonorants
- Reduced articulatory effort (lower speed of target approximation)
- Increased articulatory effort (higher speed of target approximation)
- Slight centralization of all vowels and consonants (towards an indifferent articulation)
- Strong centralization of all vowels and consonants
Birkholz P, Martin L, Willmes K, Kröger BJ, Neuschaefer-Rube C (2015). The contribution of phonation type to the perception of vocal emotions in German: an articulatory synthesis study. Journal of the Acoustical Society of America, 137(3), pp. 1503–1512
In this study we examined how the sole change of phonation type can change the perception of vocal emotions in re-synthesized portrayed emotional utterances, when other prosodic parameters (phone duration, pitch) remain the same as in the original utterance. Below are the stimuli for one of the examined sentences ("Der Lappen liegt auf dem Eisschrank"), which was originally spoken with seven emotional expressions. For each emotion, you will first hear the original utterance re-synthesized as good as possible with regard to phonation type, phone duration, and pitch. Then you hear the same sentence with the phonation type replaced by purely breathy voice, purely modal voice, and purely pressed voice. here.
Mumtaz R, Preuß S, Neuschaefer-Rube C, Hey C, Sader R, Birkholz P (2014). Tongue Contour Reconstruction from Optical and Electrical Palatography. IEEE Signal Processing Letters, 21(6), pp. 658-662
In this study we examined the potential of optical and electrical palatography (OPG and EPG) for the reconstruction of the tongue contour. Therefore, we extracted the tongue contour and the corresponding (virtual) EPG and OPG data from MRI corpora of the vocal tract of two different speakers, and trained linear models to predict the tongue contours based on the sensor data.
This supplemental material contains the tongue and vocal tract contours extracted from the MRI samples as svg files, the tongue points and simulated sensor data in Excel tables, and Matlab scripts for the cross-validation and EPG index calculation.
Birkholz P, Kröger BJ, Neuschaefer-Rube C (2011). Articulatory synthesis of words in six voice qualities using a modified two-mass model of the vocal folds. In: Proc. of the 1st International Workshop on Performative Speech and Singing Synthesis, Vancouver, BC [pdf]
In this study we examined the capability of our extended two-mass model of the vocal folds to synthesize words in different voice qualities. The following stimuli were synthesized and judged by listeners with respect to the perceived voice quality:
- Intended normal voice quality
- Intended pressed voice quality
- Intended breathy voice quality
- Intended whispery voice quality (pressed breathy/murmur)
- Intended vocal fry
- Intended falsetto
Birkholz P, Kröger BJ, Neuschaefer-Rube C (2010). Stimmsynthese mit einem Zwei-Massen-Modell der Stimmlippen mit dreieckigem Öffnungsquerschnitt. In 27th Jahrestagung der DGPP, Aachen, Germany [pdf]
- Synthetic vowel stimuli for several steps on the continuum from the minimal to the maximal glottal rest area using the classical two-mass-model of the vocal folds (wav). All stimuli were perceived as a slightly pressed to normal voice quality.
- Synthetic vowel stimuli for several steps on the continuum from the minimal to the maximal glottal rest area using the new two-mass-model of the vocal folds with a triangular glottis shape in the rest position (wav). As in reality, the voice quality changes from pressed over normal to breathy when the glottal rest area increases.