Application of the MUSIC method for estimation of the signal fundamental frequency

The goal of the paper is to use the MUSIC method to estimate the fundamental frequency of signals, and compare the results with the ones obtained by the DFT method. Real speech signals are considered in the paper. Lithuanian vowel sound fundamental frequency is estimated by the MUSIC and DFT methods.


Introduction
Estimation of a fundamental frequency is very important in many fields of speech signal processing such as speech coding, speech synthesis, speech and speaker recognition [1,9]. The speech signal fundamental frequency is an essential feature of human voice [2]. What we hear as a single sound when someone is speaking (for example, pronouncing /a/) is really the fundamental frequency plus a series of harmonics. The fundamental frequency is determined by the number of times the vocal folds vibrate in one second, and measured in cycles per second [cps], or Hertz [Hz]. The harmonics are multiples of the fundamental frequency. Thus if the fundamental frequency is 100 Hz, the harmonics are 200 Hz, 300 Hz, 400 Hz, etc. Mathematically, if y(t) is a sound signal, then we can use the following model: where a k ∈ R, ϕ k ∈ [−π, π], {e(t)} is white Gaussian noise; the lowest frequency f 0 is called the fundamental frequency, and other frequencies f k = kf 0 (k = 2, . . . , p) are called harmonics. The fundamental frequency is also called the first harmonic. We normally don't hear the harmonics as separate tones, they, however, exist in the sound and add a lot of richness to the sound. Without them a voice would sound uninteresting and synthetic [16]. Often the sinusoid of the frequency f k = kf 0 is itself called the kth harmonic of the signal y(t).
Much efforts are given in Lithuania for developing digital technologies of Lithuanian speech processing [3][4][5][6][7]11]. Lithuanian speech synthesis is one of the tasks of Lithuanian speech digital processing. In order to solve the problem of Lithuanian speech synthesis, it is necessary to develop mathematical models for Lithuanian speech sounds. Developing of the vowel models is a part of this problem. One of the main vowel models is a model of harmonically related sinusoids. The main parameter of this model is the fundamental frequency. In order to get good quality of a synthesised sound, one needs to estimate this frequency as accurately as possible. The DFT method is usually used to estimate this frequency. This method gives good results when the observed signal is sufficiently long. For shorter signals, performance of this method is not satisfactory. Thus alternative methods have to be used. One of such algorithms is the so-called MUSIC method. This method is used widely in the mobile communications field. In [10] T. Murakami and Y. Ishida applied the MUSIC method for the analysis of speech signals. They used this method for the fundamental frequency estimation of Japanese female and male vowels /a/, /e/, /i/, /o/, /u/, and illustrated that their method based on the MUSIC method is superior to the conventional cepstral method for estimating the fundamental frequency.
The goal of this paper is to apply the MUSIC method for estimation of the fundamental frequency of the main Lithuanian vowels. This paper is organized as follows. The MUSIC algorithm is reviewed in Section 2. We present comparison of the results obtained by the MUSIC method and the conventional DFT method in Section 3. Section 4 contains the conclusions.

MUSIC method
Consider the following model: where c l ∈ C, {e n } is white noise. Let M be some integer greater than p. Define x(t) = c 1 e jw 1 t , . . . , c p e jw p t T , where t = 1, . . . , N − M + 1. Define also We can now write (2) as The MUSIC method [12,13,15] was developed in 1979 by American scientist R. Schmidt. The acronym MUSIC stands for MUltiple SIgnal Classification. This method deals with estimation of parameters of (5) model.
The covariance matrix R = Ey(t)y H (t) of the vector y(t) is given by [14] where σ 2 is as in Ee(t)e H (t) = σ 2 I M×M , and P = Ex(t)x H (t).
Denote by λ 1 > λ 2 > . . . > λ M the eigenvalues of the matrix R. Since rank (AP A H ) = p [14], then Let s 1 , s 2 , . . . , s p be the unit-norm eigenvectors corresponding to the first p largest eigenvalues λ 1 , λ 2 , . . . λ p , and g 1 , g 2 It is shown in [14] that the true parameter values {w 1 , . . . , w p } are the only solutions of the following equation: a H (w)GG H a(w) = 0. In practice, we use an estimatê .
The estimates of {w 1 , . . . , w p } are obtained by maximizingP MU (e jw ). This procedure is done by evaluating it at the points of a fine grid.

Fundamental frequency estimation using the MUSIC method
We consider real data in this section. This data is samples of natural sounds. The sounds were recorded using a microphone and the "Sound Record" program. The sound recording parameters were as follows: the sampling frequency equal to 48 kHz, and the signal quantization accuracy -16 bits. This frequency corresponds to the sampling interval of 21 µs. The experiments were carried out using our programs developed in MATLAB. We applied the MUSIC method and DFT (Discrete Fourier Transform) method for Lithuanian female vowels /a/, /i/, /o/, /u/. The MUSIC spectrum was calculated using the formula whereP MU (e j 2πf ) is defined by (10). For each of the vowels mentioned above, we considered 80 records of length 1024 points. For each of these records, we calculated the spectra and estimates of the fundamental frequency by the MUSIC method and DFT, and obtained their mean E(f 0 ) and standard deviation σ (f 0 ). The results are shown in Fig. 1 and Table 1.
We see from Table 1 that the vowel /i/ has the highest fundamental frequency. The vowel /a/ has the second highest fundamental frequency. This fact can be observed in the estimation results of both methods. The lowest fundamental frequency is in the vowel /u/ (the MUSIC result) or in the vowel /o/ (the DFT result). The difference between the estimated frequencies of the vowels /o/ and /u/, however, is small -about 1 Hz (0.85 Hz for the MUSIC, and 1.17 Hz for the DFT). The smallest standard deviation 2.36 Hz was obtained by the MUSIC method for the vowel /i/, and the largest -5.59 Hz -by the DFT method for the vowel /a/. It is easy to notice that the DFT standard deviation values are higher than those of the MUSIC method for all vowels.  Since the DFT and MUSIC methods give different estimates of the fundamental frequency, we have to check which estimate describes the real situation more accurately. For each vowel, we used a model of the sum of ten harmonics where the first harmonic frequency was the fundamental frequency estimate (the DFT or MUSIC). The parameters of the harmonics were estimated by a usual linear least squares method. The relative estimation errors are shown in Table 2.
We see from Table 2 that the errors obtained with the MUSIC fundamental frequency estimate are smaller than those obtained with the DFT fundamental frequency estimate. The fact that the errors are rather large can be attributed to complexity of the real sound signals. Harmonics of such signals are time variant, and it is very difficult to describe the signal with time invariant frequency models.
The true signal of the vowel /u/ and its estimates obtained by the DFT and MUSIC methods are shown in Fig. 3. The signal estimates are obtained using a model of the sum of 10 harmonics. One can see that the MUSIC estimate almost coincides with the true signal.

Conclusions
Estimation of the fundamental frequency is very important for synthesis of Lithuanian speech vowels since this frequency is the main parameter of the vowel models. It shows a frequency at which impulses must be given to the input of the synthesizer forming filter.
The model (1) can be used in vowel recognition. A vector made of the fundamental frequency and amplitudes of the first 10-20 harmonics can be taken as a recognition vector. The harmonic amplitudes are obtained by a simple least squares fit.
Our investigation has shown that the estimates of the fundamental frequency obtained by the MUSIC method are less scattered around their average if compared with the ones obtained by the DFT method.
Approximation of a real sound signal by a sum of the first 10 harmonics with the fundamental frequency obtained by the MUSIC and DFT methods gave a smaller error in the case of the MUSIC method.