next up previous
Next: AUDITORY SOUND SEGREGATION MODEL Up: Segregation of vowel in Previous: ABSTRACT

INTRODUCTION

Bregman has reported that the human auditory system uses four psychoacoustically heuristic regularities related to acoustic events to solve the problem of auditory scene analysis [1]. If an auditory sound segregation model was constructed using these regularities, it would be applicable not only to a preprocessor for robust speech recognition systems but also to various types of signal processing.

Some ASA-based segregation models already exist. There are two main types of models, based on either bottom-up [2] or top-down processes [3,4]. All these models use some of the four regularities, and the amplitude (or power) spectrum as the acoustic feature. Thus they cannot completely segregate the desired signal from noisy signal when the signal and noise exist in the same frequency region.

In contrast, we have discussed the need to use not only the amplitude spectrum but also the phase spectrum in order to completely extract the desired signal from a noisy signal, thus addressing the problem of segregating two acoustic sources [5]. This problem is defined as follows [5,7]. First, only the mixed signal f(t), where f(t)=f1(t)+f2(t), can be observed. Next, f(t) is decomposed into its frequency components by a filterbank (K channels). The output of the k-th channel Xk(t) is represented by

 \begin{displaymath}X_k(t)=S_k(t)\exp(j\omega_k t + j\phi_k(t)).
\end{displaymath} (1)

Here, if the outputs of the k-th channel, which correspond to f1(t) and f2(t), are assumed to be $A_k(t)\exp(j\omega_k t + j\theta_{1k}(t))$ and $B_k(t)\exp(j\omega_k t + j\theta_{2k}(t))$, then instantaneous amplitudes Ak(t) and Bk(t) can be determined by
  
Ak(t) = $\displaystyle {S_k(t)\sin(\theta_{2k}(t)-\phi_k(t))}/{\sin\theta_k(t)},$ (2)
Bk(t) = $\displaystyle {S_k(t)\sin(\phi_k(t)-\theta_{1k}(t))}/{\sin\theta_k(t)},$ (3)

where $\theta_k(t)=\theta_{2k}(t)-\theta_{1k}(t)$, $\theta_k(t)\not= n\pi, n\in{\bf {Z}}$, and $\omega_k$ is the center frequency of the k-th channel. Instantaneous phases $\theta_{1k}(t)$ and $\theta_{2k}(t)$ can be determined by
$\displaystyle \theta_{1k}(t)$ = $\displaystyle -\arctan\left( \frac{Y_k(t)\cos\phi_k(t)-\sin\phi_k(t)}{Y_k(t)\sin\phi_k(t)+\cos\phi_k(t)} \right)$  
    $\displaystyle +\arcsin\left(\frac{A_k(t)Y_k(t)}{S_k(t)\sqrt{Y_k(t)^2+1}}\right),$ (4)
$\displaystyle \theta_{2k}(t)$ = $\displaystyle -\arctan\left( \frac{Y_k(t)\cos\phi_k(t)+\sin\phi_k(t)}{Y_k(t)\sin\phi_k(t)-\cos\phi_k(t)} \right)$  
    $\displaystyle +\arcsin\left(-\frac{B_k(t)Y_k(t)}{S_k(t)\sqrt{Y_k(t)^2+1}}\right),$ (5)

where $Y_k(t)={\sqrt{(2A_k(t)B_k(t))^2-Z_k(t)^2}}/{Z_k(t)}$ and Zk(t)=Sk(t)2-Ak(t)2-Bk(t)2. Hence, f1(t) and f2(t) can be reconstructed by using the determined pair of [Ak(t) and $\theta_{1k}(t)]$ and the determined pair of [Bk(t) and $\theta_{2k}(t)]$ for all channels. However, Ak(t), Bk(t), $\theta_{1k}(t)$, and $\theta_{2k}(t)$ cannot be uniquely determined without some constraints as is easily understood from the above equations. Therefore, this problem is an ill-inverse problem.


 
Table: Constraints corresponding to Bregman's psychoacoustical heuristic regularities.
Regularity (Bregman, 1993) Constraint (Unoki and Akagi, 1999)  
(i) common onset/offset synchronous of onset/offset $\vert T_{\rm {S}}-T_{k,\rm {on}}\vert \leq \Delta T_{\rm {S}}, \vert T_{\rm {E}}-T_{k,\rm {off}}\vert \leq \Delta T_{\rm {E}}$
(ii) gradualness of change piecewise-differentiable polynomial ${dA_k(t)}/{dt}=C_{k,R}(t), {d\theta_{1k}(t)}{dt}=D_{k,R}(t)$
  approximation dF0(t)/dt=E0,R(t)
(smoothness) (spline interpolation) $\sigma_A=\int_{t_a}^{t_b} [A_k^{(R+1)}(t)]^2dt \Rightarrow \min$
    $\sigma_\theta =\int_{t_a}^{t_b} [\theta_{1k}^{(R+1)}(t)]^2dt \Rightarrow \min \qquad$ ( new)
(iii) harmonicity multiples of the fundamental frequency $n\times F_0(t), \qquad n=1,2,\cdots, N_{F_0}$
(iv) changes occurring in correlation between the instantaneous $\frac{A_k(t)}{\Vert A_k(t)\Vert} \approx \frac{A_{\ell}(t)}{\Vert A_\ell(t)\Vert}$, $\qquad k\not=\ell$
the acoustic event amplitudes  

To solve this problem, we have proposed a basic method of solving it using constraints related to the four regularities [5] and the improved method [6]. However, the former cannot deal with the variation of the fundamental frequency, although it can segregate the synthesized signal from the noise-added signal. Additionally, for the later, it is difficult to completely determine the phases, although it can be segregated vowel from noisy vowel precisely at certain amplitudes by constraining the continuity of the instantaneous amplitudes and fundamental frequencies.

This paper proposes a new sound segregation method to deal with real speech and noise precisely even in waveforms, by using constraints on the continuity of instantaneous phases as well as constraints on the continuity of instantaneous amplitudes and fundamental frequencies.


next up previous
Next: AUDITORY SOUND SEGREGATION MODEL Up: Segregation of vowel in Previous: ABSTRACT
Masashi Unoki
2000-10-26