| The Model |
Talker Identification |
2001 Presentation |
Waveform Graphics |
Female/Child Formant Values |
Author's Background |
Main Page |
Introduction
Since 1996 it has been demonstrated that visual inspection of raw complex waveforms can be used to identify the vowel produced by a talker (Stokes, 1996). This research resulted in the MAS Model of Vowel Perception and Production (Stokes, 1998). More recently, another identification experiment extended this work with visual vowel identification from waveform displays to female talkers as well as male talkers (Stokes, 2001). Together, this work represents the only ongoing comprehensive research involving visual inspection of raw complex waveforms, and continues to provide innovations that are supported by experimental evidence. The innovations include a new way to categorize and organize the vowel space, a direct link between articulatory gestures and formant frequency, and an explanation for 98.6% of perceptual errors. The work above deals with identifying any vowel produced by any talker. In contrast, the present study involves identifying a specific talker who produced a vowel. From visual displays of raw complex waveforms derived from small samples of speech, a talker can be reliably identified. In the study described below, unique voice signatures identified from waveform displays are used to identify a talker from a set of 10 talkers in the same way as one would identify a person from fingerprints.
Objective
Biometrics are measurable physiological and/or behavioral characteristics that can be used to identify an individual. These include fingerprints, retinal and iris scans, facial recognition, voice patterns, and several other techniques. Each person has a unique quality to their speech pattern because of physical characteristics such as vocal tract length, vocal folds, shape of the oral cavity, and the shape of the articulators (teeth, tongue, and lips). Each talker is physically able to produce the acoustic patterns that listeners perceive as one vowel or another (see MAS Model), but these patterns retain individual voice signature characteristics. This is directly analogous to fingerprints, where each finger has a unique physical pattern of ridges and pores that give it a signature pattern unique to that person and no one else.
Prior to the development of the MAS Model and various computer advancements that allowed waveforms to be analyzed quicker and with greater detail, working with waveforms was difficult and had limited results. Because the work associated with the MAS Model and the limited amount of work with complex waveforms prior to the development of the model was focused on recognizing vowel patterns, identifying a person from a waveform would have been premature. However, with recent computer advancements and the organization of the vowel space established by the MAS Model, extending the work with waveforms to talker identification can now be pursued.
The visual patterns within a pitch period of a given vowel are the result of combining F0, F1, and F2, and the higher formants creating a raw complex waveform as unique as fingerprints for each person. The physical similarities in these acoustic attributes that would need to occur for identification are mathematically on a scale comparable to that of fingerprints. The goal of the present study is to illustrate that the unique visual waveform patterns can reliably be used to identify an individual from small samples of speech. The study below involves taking a sample of speech from a talker and attempt to identify that person from a group of 10 talkers simply by matching the patterns within the raw complex waveforms.
Method
One subject (MS) participated in two talker identification trials. The trials were performed on a Compaq Presario 2250 using the TFR software package to view the speech files on a 15 inch MV500 monitor. The speech productions used in the trials were randomly selected from a 100 talker (50 male and 50 female) database of hVd productions (had, heed, hod, etc.) collected and maintained by Dr. John Mullennix (University of Pittsburgh at Johnstown). Three tokens of each hVd word from each talker were recorded using the CSRE software package. The talkers were all Midwestern speakers.
Each trial involved one production from 10 male talkers of the same word (10 test words), and one additional token of the word produced by one of the 10 talkers (1 match word). Trial 1 used eleven productions of the word /heed/ (10 test words + 1 match word = 11), and Trial 2 used eleven productions of the word /who'd/. The match word and identifying test word are different productions of the same word by one talker. The task was for MS was to choose which of the 10 test words was produced by the male producing the match word.
MS selected a 15 to 20 msec sample from the middle of each vowel and enlarged several pitch periods for detailed visual analysis of the waveform voice signature characteristics. Several test words were quickly rejected as possible matches, and the remaining possibilities were further reviewed before an identification was made. Dr. Mullennix verified which test word was produced by the match word talker after MS completed both trials. No audio cues were available to MS during testing. Visual inspection and comparison of the voice signatures seen from the displays of the raw complex waveforms were the only means used for talker identification. Figures 1 through 4 illustrate the approximately 20 msec displays MS used during the trials with Figures 1 and 2 being the matching samples.
Figure 1 - Match word /heed/ (Talker 1, token 1)

Figure 2 - Identifying test word /heed/ (Talker 1, token 2)

Figure 3 - Test word /heed/ (Talker 2)

Figure 4 - Test word /heed/ (Talker 3)

Results
In Trial 1, MS correctly identified the talker of the match word as the talker who produced token #8 of the 10 test words. In Trial 2, MS correctly identified the talker of the match word as the talker who produced token #3 of the 10 test words. In both trials, MS correctly identified the talker in question from the set of the 10 test words.
The 100% accuracy for identifying the talker from the words /heed/ and /who'd/ illustrates the robustness of this technique, specifically because the vowels in these two words have the fewest number of visual cues per pitch period when compared to the other vowels in English. This is due to the fact that these two vowels have the fewest number of F1 cycles per pitch period, making these the least complex of the vowels in the vowel space. As the number of F1 cycles per pitch period increases, the number of reference and identifiable features within a pitch period increases. In other words, one F1 cycle per pitch period does not provide the same number of potential visual cues as a vowel with 4 or 5 cycles per pitch period. Therefore, it would be expected that the task of talker identification would be an extremely reliable and efficient task as one identifies vowels other than those produced in the words /heed/ and /who'd/.
Discussion
With talker identification being achieved from waveform displays, this will add to the reliable biometrics available for individual identification, and would allow voice signatures to be used on a larger scale for talker identification and as a more reliable tool for talker verification. Voice verification has had some success, but is not as reliable as some other measures due to variability of transducers and the environment where the speech is produced (background noise, etc.). The method presented here is comparable to fingerprints, in that fingerprints are rarely without flaws and are left in a variety of environments (glass, wood, etc.). Further work with waveforms produced under less than perfect conditions will need to take place, but there are applications that could utilize this method of identification. With analysis of a variety of speech produced in a variety of conditions, waveforms should lead to a level of identification success as that achieved by fingerprints.
Waveform identification will require that the points used for identification be standardized. This would involve identifying points within each vowel that can be used for identification and training. This report is to document that this can be done in much the same way that Stokes (1996) reported his success with identifying vowels. Furthermore, these techniques apply across languages since the method for producing vowels is the same across languages. In other words, the techniques used to identify English will work with Arabic or any other language.
References
Stokes, M.A. (2001), "Male and female vowels identified by visual inspection of raw complex waveforms," presented at the 141st meeting: Acoustical Society of America.
Stokes M.A. (1998), "MAS Model of Vowel Perception and Production," posted on the internet at: http://home.indy.net/~masmodel/
Stokes M.A. (1996), "Identification of vowels based on visual cues within raw complex waveforms," presented at the 131st meeting: Acoustical Society of America.