Supplemental Materials

The material below is intended to supplement selected publications by Peter Birkholz and coworkers. They contain, for example, audio stimuli that were used in perception experiments described in the papers.

Huang Z, Zhang T, Birkholz P (submitted). Self-supervised multi-task learning for enhanced prosody prediction in articulatory speech synthesis.

This study presents a multi-task prosody prediction model for German articulatory speech synthesis, combining duration, f0 and voicing prediction within an LSTM-based framework. The goal was to investigate how different self-supervised pre-training strategies affect downstream prosody modelling. The supplemental material contains the 20 sentences from the Kiel Corpus synthesized with each pre-training strategy, which were used in the perceptual evaluation experiment reported in the paper.

Zhang T, Birkholz P (submitted). Joint Estimation of Source and Filter Parameters for Speaker Adaptation in Articulatory Speech Synthesis.

This study used MRI data of the vocal tract of two participants to create two new speaker models for the artuiculatory synthesizer VocalTractLab. The focus of the study was the joint optimization of the vocal tract configuration and the glottal rest configuration to reproduce the naturally-spoken vowels as closely as possible. The audio files of the natural and optimized vowels can be found here.

Kleiner C, Birkholz P, Schäfer D, Arai T (2025). Acoustical characterization and perceptual comparison of four types of 3D-printed vocal tract models for the German and Japanese vowels /a,e,i,o,u/.

This study made an acoustical and perceptual comparison of four types of 3D-printed vocal tract models for the German and Japanese vowels /a,e,i,o,u/ with different degrees of geometrical simplification and acoustical tuning. The supplemental material here contains the 20 stimuli for the listening experiment. They were created by voiced excitation of 20 physical vocal tract models, one for each model type (GER-MRI, JAP-BENT, GER-STRAIGHT, JAP-STRAIGHT) and vowel (/a,e,i,o,u/), using a reed source.

Steiner P, Huang Z, Birkholz P (2025). Neural prosody prediction for German articulatory speech synthesis.

This study introduces a prosody prediction model for German articulatory speech synthesis using a recurrent neural network. The focus of this study was to explore the effect of four different phoneme embedding techniques on the model performance. The supplemental material includes the 20 sentences that were synthesized with each of four embedding methods and used in the perception experiment in the study.

Fietkau AL, Menezes J, Birkholz P (2025). Evaluating optopalatography sensor positions for silent command word recognition.

This study used a custom-developed OPG device as silent-speech interface for the task of command word recognition and investigated the relevance of different sensing positions. The supplemental material here contains the hyperparameter sets which yielded the highest recognition accuracies for each evaluation type and speaker combination. The search space for each hyperparameter is described in the paper.

Birkholz P, Häsner P (2024). Measurement and simulation of pressure losses due to airflow in vocal tract models.

This study proposes a unified model for viscous and kinetic energy losses in a discrete tube model of the vocal system including the glottis. The parameters of the model were adjusted to reproduce the results of measurements with physical replicas of the glottis and the vocal tract. The supplemental material includes the 3D-printable geometries of the vocal tract and glottis replicas as well as the continuous and discrete area functions of the vocal tract models.

Possamai de Menezes JV, Kleiner C, Kainz MA, Echternach M, Birkholz P (2024). Synchrony of glottal area waveform parameters during the production of obstruents in vowel context.

This study investigated the glottal area waveforms during the production of voiced and voiceless obstruents in vowel context to analyze the degree of synchrony of the waveform parameters. The supplemental material includes, for all recorded VCVs, the glottal area waveform (derived from high-speed laryngoscopy films) and the time functions of the following derived parameters: open quotient, fundamental frequency, AC amplitude, and DC amplitude.

Stone S, Birkholz P (2024). Monophthong vocal tract shapes are sufficient for articulatory synthesis of German primary diphthongs.

This study explored the articulatory synthesis of German primary diphthongs from different combinations of monophthong vocal tract shapes and their perceived naturalness. The supplemental material includes the data from the formant analysis of natural diphthongs, the synthesized audio files, etc.

Blandin R, Stone S, Remacle A, Didone V, Birkholz P (2023). A comparative study of 3D and 1D acoustic simulations of the higher frequencies of speech.

This study explored the perceptual impact of the high frequency content (> 4 kHz) of vowels synthesized with different degrees of physical realism, using a 1D acoustic simulation based on a transmission-line model of the vocal tract vs. using a 3D acoustic simulation based on the multimodal method vs. using an artificial bandwidth-extension algorithm. The supplemental material includes the synthetic stimuli and the code used to generate the stimuli.

Blandin R, Geng J, Birkholz P (2023). Investigation of the influence of the torso, lips and vocal tract configuration on speech directivity using measurements from a custom head and torso simulator.

This study explored the combined influence of the torso, the lips and the vocal tract geometry on speech directivity. The supplemental material contains the measured and simulated directivity data of this study.

Birkholz P, Blandin R, Kürbis S (2023). Bandwidths of vocal tract resonances in physical models compared to transmission-line simulations.

This study investigated how the bandwidths of resonances simulated by transmission-line models of the vocal tract compare to bandwidths measured from physical 3D-printed vowel resonators. Please find the supplemental material here. For the three types of physical resonators, it contains the 3D-printable resonator models (in terms of STL files) and the measured volume velocity transfer functions (in terms of text files). There are also the Matlab scripts to extract the resonance frequencies and bandwidths from the transfer functions and save them as text files. Also included are the area functions of the simulated vowels and the Matlab file that performs the frequency-domain simulation (either for the coarse loss model or the detailed loss model), extracts the resonance frequencies and bandwidths, and saves them to text files.

Birkholz P, Ossmann S, Blandin R, Wilbrandt A, Krug P, Fleischer M (2022). Modeling speech sound radiation with different degrees of realism for articulatory synthesis.

This study proposed two models for the radiation characteristic, which are more realistic than the commonly used first-order high-pass filter, and explores their acoustic and perceptual impact. The supplemental material consists of the synthesized audio files, and the simulated radiation characteristics, also in terms of digital FIR filters for different common sampling rates.

Langheinrich I, Stone S, Zhang X, Birkholz P (2022). Glottal inverse filtering based on articulatory synthesis and deep learning.

This study proposed a method for glottal inverse filtering based on a neural network that was trained with speech and glottal flow signals generated by the articulatory speech synthesizer VocalTractLab. The supplemental material consists of the synthesized audio and glottal flow signals, as well as the text material, the baseline segment sequence files, the baseline and manipulated gestural score files, and the training and test split.

Birkholz P, Mayer CK, Häsner P (2022). Towards a soft fluidic elastomer tongue for a mechanical vocal tract.

This study presents the prototype of a soft fluidic elastomer tongue with three water-filled chambers. The tongue can be deformed by filling the chambers with varying volumes of water. The supplemental material contains the CAD parts of the vocal tract walls and the mold for the tongue, as well as the synthetic stimuli generated with the physical model.

Stone S, Abdul-Hak P, Birkholz P (2022). Perceptual cues for smiled voice - An articulatory synthesis study.

This study investigated the perceptual cues in continuous speech for smiled voice with an analysis-by-synthesis approch using the VocalTractLab articulatory synthesizer. The supplemental material contains the synthetic stimuli used in the perception experiment and the files needed to synthesize them.

Digehsara PA, Possamai de Menezes JV, Wagner C, Bärhold M, Schaffer P, Plettemeier D, Birkholz P (2022). A user-friendly headset for radar-based silent speech recognition.

This study introduces and evaluates a 3D-printable headset that was designed for a consistent (cross-session) placement of antennas on a users face for a radar-based silent speech interface. The supplemental material contains the STL files of the printable parts for the headset.

Wagner C, Schaffer P, Digehsara PA, Bärhold M, Plettemeier D, Birkholz P (2022). Silent speech command word recognition using stepped frequency continuous wave radar.

This study describes and evaluates the feasibility of using radar transmission spectra in the 1-6 GHz range for silent speech recognition, evaluated on a small-vocabulary German two-syllable corpus. The supplemental materials contains the complete command word corpus, i.e., every utterance saved with their respective radar spectrogram (binary file), a text label file and audio file (.wav). It also provides a Matlab script to display the individual samples side-by-side.

Birkholz P, Häsner P, Kürbis S (2022). Acoustic comparison of physical vocal tract models with hard and soft walls.

This study explored how the frequencies and bandwidths of the acoustic resonances of physical tube models of the vocal tract differ when they have hard vs. soft walls. The supplemental material here contains the area functions and the 3D-printable STL files of the tube models and the casting molds, and the measured transfer functions.

Birkholz P, Drechsel S (2021). Effects of the piriform fossae, transvelar acoustic coupling, and laryngeal wall vibration on the naturalness of articulatory speech synthesis. Speech Communication, 132, pp. 96-105. doi: 10.1016/j.specom.2021.06.002

This study explored how the piriform fossae, transvelar acoustic coupling, and laryngeal wall vibration affect the naturalness of articulatory speech synthesis with VocalTractLab. The supplemental material here contains the synthesized stimuli used in the experiments.

Blandin R, Arnela M, Felix S, Doc JB, Birkholz P (2021). Comparison of the finite element method, the multimodal method and the transmission-line model for the computation of vocal tract transfer functions. In Proc. of the Interspeech 2021, pp. 3330-3334, Brno, Czech Republic

In this study, three methods for the computation of the vocal tract transfer function were compared by means of eight 3D vocal tract shapes (four vowels generated by each a male and a female vocal tract model). The supplemental material here contains the 3D vocal tract shapes and a list with the obtained resonance frequencies and bandwidths.

Wagner C, Stappenbeck L, Wenzel H, Steiner P, Lehnert B, Birkholz P (2021). Evaluation of a non-personalized optopalatographic device for prospective use in functional post-stroke dysphagia therapy. IEEE Transactions on Biomedical Engineering

In this study, a retainer-free optopalatographic device is presented that is capable of measuring tongue movement trajectories in situ to provide real-time feedback of the articulation. The supplemental material here contains the sensor data acquired with the proposed device from a group of subjects during the performance of different tongue exercises that are commonly used in dysphagia therapy. The source code for the analysis and classification of the data is also contained.

Häsner P, Prescher A, Birkholz P (2021). Effect of wavy trachea walls on the oscillation onset pressure of silicone vocal folds. The Journal of the Acoustical Society of America, 149(1), pp. 466-475

In this study, the influence of non-smooth trachea walls on phonation onset and offset pressures and on the fundamental frequency of oscillation were experimentally investigated for three different synthetic vocal fold models. The supplemental material here contains a video that shows the formation of surface waves on the trachea walls when the trachea is streched or compressed, as well as the CAD and STL geometry files that were created for 3D-printing parts of the experimental setup and the vocal fold models.

Xue Y, Marxen M, Akagi M, Birkholz P (2021). Acoustic and articulatory analysis and synthesis of shouted vowels. Computer Speech & Language, 66, 101156, doi: 10.1016/j.csl.2020.101156

In this study we compared spoken and shouted vowels, both at the acoustic and articulatory levels. The supplemental material here contains the acoustic features, the cross-distance functions of the vocal tract, and the vocal tract contours of the spoken and shouted vowels analyzed in the study.

Stone S, Gao Y, Birkholz P (submitted). Articulatory synthesis of vocalized /r/ allophones in German.

In this study we examined vocalic /r/ allophones in German in the context of articulatory speech synthesis. The supplemental material here contains the audio recordings of the analyzed words with /r/ allophones as well as the synthesized words for the perception experiment.

Stone S, Birkholz P (2021). Articulation-to-speech using electro-optical stomatography and articulatory synthesis. ISSP 2021

In this study we controlled the articulatory speech synthesizer VocalTractLab with articulatory movements captured by electro-optical stomatography, i.e., with contact and distance sensors on an artificial palate that is worn in the mouth. This allows to generate audible speech from silently generated speech movements. The supplemental material here provides the audio files for a range of generated speech sounds with the system.

Stone S, Azgin A, Mänz S, Birkholz P (2021). Prospects of articulatory text-to-speech synthesis. ISSP 2021

In this study we outline our architecture for an articulatory text-to-speech synthesis system based on VocalTractLab. With the audio files in this supplemental material we demonstrate a variety of prosodic modifications that can be applied to the generated utterances using simple rules. Please download the examples here.

Stone S, Schmidt P, Birkholz P (2020). Prediction of voicing and the f0 contour from electromagnetic articulography data for articulation-to-speech synthesis. ICASSP 2020

In this study we examined to which extent EMA data allow the prediction of voicing and f0 of the speech signal. Predicted f0 contours were imposed on the natural speech recordings performed during the EMA measurements and used in a perception experiment. The stimuli of the experiment can be found here.

Birkholz P, Drechsel S, Stone S (2019). Perceptual optimization of an enhanced geometric vocal fold model for articulatory speech synthesis. Interspeech 2019

In this study we examined which control parameter settings of an enhanced geometric vocal fold model would generate the most natural-sounding male and female voices. The male samples (five German words) with the best parameter settings can be played here. The female samples with the best parameter settings can be played here.

Birkholz P, Gabriel F, Kürbis S, Echternach M (2019). How the peak glottal area affects linear predictive coding-based formant estimates of vowels. The Journal of the Acoustical Society of America, 146(1), pp. 223-232.

The supplemental material here contains the CAD files for 3D-printing the ten vocal tract resonators, the CAD files needed to create the silicone vocal fold model, and the audio files of the synthetic vowels generated with the physical models and the computer simulations.

Birkholz P, Stone S, Kürbis S (2019). Comparison of different methods for the voiced excitation of physical vocal tract models. In: Birkholz P, Stone S (eds.) Studientexte zur Sprachkommunikation: Elektronische Sprachsignalverarbeitung 2019 (TUDPress, Dresden)

The supplemental material here contains the CAD files for 3D-printing the eight vocal tract resonators for the eight long German vowels and the audio files of the stimuli generated using the different excitation methods.

Birkholz P, Pape D (submitted). How modeling entrance loss and flow separation in a two-mass model affects the oscillation and synthesis quaity.

The supplemental material here contains the audio stimuli used for the perception experiments.

Stone S, Marxen M, Birkholz P (2018). Construction and Evaluation of a Parametric One-Dimensional Vocal Tract Model. IEEE/ACM Transactions on Audio, Speech and Language Processing, 26(8), pp. 1381-1392. doi: 10.1109/TASLP.2018.2825601

The supplemental material here contains the audio stimuli used for the perception experiments and an additional table with parameter values.

Klause F, Stone S, Birkholz P (2017). A head-mounted camera system for the measurement of lip protrusion and opening during speech production. In: Trouvain J, Steiner I, Möbius B (eds.) Studientexte zur Sprachkommunikation: Elektronische Sprachsignalverarbeitung 2017 (TUDPress, Dresden), pp. 145-151 [pdf]

This paper presents a head-mounted camera system for the simultaneous measurement of lip protrusion and opening during speech production. All necessary files for the construction of the camera helmet, and the software and scripts for the extraction of the parameters can be downloaded here.

Birkholz P, Martin L, Xu Y, Scherbaum S, Neuschaefer-Rube C (2016). Manipulation of the prosodic features of vocal tract length, nasality and articulatory precision using articulatory synthesis. Computer Speech & Language, 41, pp. 116-127

In this study we examined the articulatory generation of secondary prosodic features using articulatory speech synthesis. Therefore, we manipulated certain articulatory features of re-synthesized German words and asked listeners to rate the prosodic effects. We applied the following articulatory manipulations:

Best possible resynthesis of the natural utterance (no manipulation)
Simulation of a longer vocal tract (larynx lowering and lip protrusion)
Simulation of a shorter vocal tract (larynx raising and lip retraction)
Nasalized articulation of all sonorants
Reduced articulatory effort (lower speed of target approximation)
Increased articulatory effort (higher speed of target approximation)
Slight centralization of all vowels and consonants (towards an indifferent articulation)
Strong centralization of all vowels and consonants

In the following list, you can listen to each of the nine words synthesized in each of the above variants (in the given order):

Here you can furthermore download the gestural scores and the speaker file used to synthesize the stimuli with VocalTractLab 2.1.

Birkholz P, Martin L, Willmes K, Kröger BJ, Neuschaefer-Rube C (2015). The contribution of phonation type to the perception of vocal emotions in German: an articulatory synthesis study. Journal of the Acoustical Society of America, 137(3), pp. 1503–1512

In this study we examined how the sole change of phonation type can change the perception of vocal emotions in re-synthesized portrayed emotional utterances, when other prosodic parameters (phone duration, pitch) remain the same as in the original utterance. Below are the stimuli for one of the examined sentences ("Der Lappen liegt auf dem Eisschrank"), which was originally spoken with seven emotional expressions. For each emotion, you will first hear the original utterance re-synthesized as good as possible with regard to phonation type, phone duration, and pitch. Then you hear the same sentence with the phonation type replaced by purely breathy voice, purely modal voice, and purely pressed voice.

The whole supplemental material for the paper (also including the screenshot in Fig. 2) can be downloaded here.

Mumtaz R, Preuß S, Neuschaefer-Rube C, Hey C, Sader R, Birkholz P (2014). Tongue Contour Reconstruction from Optical and Electrical Palatography. IEEE Signal Processing Letters, 21(6), pp. 658-662

In this study we examined the potential of optical and electrical palatography (OPG and EPG) for the reconstruction of the tongue contour. Therefore, we extracted the tongue contour and the corresponding (virtual) EPG and OPG data from MRI corpora of the vocal tract of two different speakers, and trained linear models to predict the tongue contours based on the sensor data.
This supplemental material contains the tongue and vocal tract contours extracted from the MRI samples as svg files, the tongue points and simulated sensor data in Excel tables, and Matlab scripts for the cross-validation and EPG index calculation.

Birkholz P, Kröger BJ, Neuschaefer-Rube C (2011). Articulatory synthesis of words in six voice qualities using a modified two-mass model of the vocal folds. In: Proc. of the 1st International Workshop on Performative Speech and Singing Synthesis, Vancouver, BC [pdf]

In this study we examined the capability of our extended two-mass model of the vocal folds to synthesize words in different voice qualities. The following stimuli were synthesized and judged by listeners with respect to the perceived voice quality:

Birkholz P, Kröger BJ, Neuschaefer-Rube C (2010). Stimmsynthese mit einem Zwei-Massen-Modell der Stimmlippen mit dreieckigem Öffnungsquerschnitt. In 27th Jahrestagung der DGPP, Aachen, Germany [pdf]

Synthetic vowel stimuli for several steps on the continuum from the minimal to the maximal glottal rest area using the classical two-mass-model of the vocal folds (wav). All stimuli were perceived as a slightly pressed to normal voice quality.
Synthetic vowel stimuli for several steps on the continuum from the minimal to the maximal glottal rest area using the new two-mass-model of the vocal folds with a triangular glottis shape in the rest position (wav). As in reality, the voice quality changes from pressed over normal to breathy when the glottal rest area increases.