Next: AUDITORY SEGREGATION MODEL Up: SIGNAL EXTRACTION FROM NOISY Previous: ABSTRACT

INTRODUCTION

Bregman reported that the human auditory system uses four psychoacoustically heuristic regularities related to aco-ustic events, to solve the problem of Auditory Scene Analysis (ASA) [1]. If a segregation model was constructed using constraints related to these heuristic regularities, it would be applicable not only to a preprocessor for robust speech recognition systems but also to various types of signal processing.

Some ASA-based segregation models already exist. There are two main types of models, based on either bottom-up [2] or top-down processes [3,6]. All these models use some of the four regularities, and the amplitude (or power) spectrum as the acoustic feature. Thus they cannot completely extract the desired signal from a noisy signal when the signal and noise exist in the same frequency region.

In contrast, we have discussed the need to use not only the amplitude spectrum but also the phase spectrum in order to completely extract the desired signal from a noisy signal, addressing the problem of segregating two acoustic sources [8]. This problem is defined as follows [8]. First, only the mixed signal f(t), where f(t)=f₁(t)+f₂(t), can be observed. Next, f(t) is decomposed into its frequency components by a filterbank (the number of channels is K). The output of the k-th channel X_k(t) is represented by

$\begin{displaymath}X_k(t)=S_k(t)\exp(j\omega_k t + j\phi_k(t)). \end{displaymath}$

(1)

Here, if the outputs of the k-th channel, which correspond to f₁(t) and f₂(t), are assumed to be $A_k(t)\exp(j\omega_k t + j\theta_{1k}(t))$ and $B_k(t)\exp(j\omega_k t + j\theta_{2k}(t))$ , then the instantaneous amplitudes of the two signals A_k(t) and B_k(t) can be determined by

A_k(t)	=	$\displaystyle {S_k(t)\sin(\theta_{2k}(t)-\phi_k(t))}/{\sin\theta_k(t)},$	(2)
B_k(t)	=	$\displaystyle {S_k(t)\sin(\phi_k(t)-\theta_{1k}(t))}/{\sin\theta_k(t)},$	(3)

where $\theta_k(t)=\theta_{2k}(t)-\theta_{1k}(t)$ , $\theta_k(t)\not= n\pi, n\in{\bf {Z}}$ , and $\omega_k$ is the center frequency of the k-th channel. Here, $\theta_{1k}(t)$ and $\theta_{2k}(t)$ are the instantaneous input phases of f₁(t) and f₂(t), respectively. Finally, f₁(t) and f₂(t) can be reconstructed by using the determined $[A_k(t), \theta_{1k}(t)]$ , and $[B_k(t), \theta_{2k}(t)]$ for all channels.

This problem is an ill-inverse problem because there are currently no equations for determining the two instantaneous phases. Therefore, we have proposed a method of solving this problem using constraints related to the four regularities [8]. It was assumed that the fundamental frequency was constant and known, and that $\theta_{1k}(t)=0$ , although this method could extract the synthesized vowel from a noisy synthesized vowel with high accuracy. Here, $\theta_{1k}(t)=0$ means that each frequency of the signal component that passed through the channel coincides with the center frequency of each channel. Therefore, it is difficult to extract real speech from noisy speech using this method because the fundamental frequency of speech fluctuates, and multiples of the fundamental frequency cannot coincide with the center frequencies of the channels.

This paper proposes a new method for extracting real speech from noisy speech by (1) incorporating of a method of estimating the fundamental frequency and (2) reconsidering the constraint of $\theta_{1k}(t)$ .

Next: AUDITORY SEGREGATION MODEL Up: SIGNAL EXTRACTION FROM NOISY Previous: ABSTRACT

Masashi Unoki
2000-10-26