Akagi, M. and Tohkura, Y. (1988). "On the application of spectrum target prediction model to speech recognition", Proc. Int. Conf. Acoustics Speech and Signal Process., New York, 139-142.

ABSTRACT

This paper proposes a preprocessing method for automatic speech recognition by using a Spectrum Target Prediction Model to cope with co-articulation, one of the most serious problems in automatic speech recognition. The method is evaluated by three measures: spectral stability to measure predicted spectrum variation in each phoneme portion, intra-category variation, and inter-category variation. Experiment results indicate that predicted spectra through out the model are stabilized in each phoneme portion by eliminating variations of original spectra without prediction. The results also indicate that by using the preprocessing method, intra-category variation decreases and inter-category variation increases. Consequently, the Spectrum Target Prediction Model implemented as a speech recognition preprocessor subsequently improves automatic speech recognition performance.
 
 
 
 

Akagi, M. (1990). "Evaluation of a spectrum target prediction model in speech perception", J. of Acoust. Society of America, 87, 2, 858-865.

ABSTRACT

A model of a spectrum target prediction mechanism is proposed and evaluated by comparing predicted values with results of psychoacoustic experiments. When the trajectory of the cepstrally smoothed LPC spectrum is approximated by a 2nd-order critically damped system, the proposed model can estimate target values using short-period spectrum sequences (50 ms) without being given the onset positions of the spectral transition. Additionally, this model decreases the length of transitional sounds and recovers vowel characteristics neutralized by co-articulation. Moreover, this model compensates for the transitions of syllables and extracts stable characteristics from syllable transitions. This model is applicable to co-articulation recovery in speech signal processing.

 

 

Akagi, M. and Tohkura, Y. (1990). "Spectrum target prediction model and its application to speech recognition", Computer Speech and Language, 4, Academic Press 325-344.

ABSTRACT

This paper proposes a model of a spectrum target prediction mechanism and a preprocessing method for automatic speech recognition by using the model to cope with co-articulation. The model is constructed to predict particular spectra for each phoneme, that is phoneme target, and to keep their spectra constant in each phoneme interval. The method is evaluated by four measures: spectrum sequence stability: Are predicted spectrum sequence in each phoneme interval fixed?, intra-category spectrum variation: Is a variation of predicted spectra in each phoneme category small?, inter-category spectrum variation: Is phoneme category pair far apart measuring by the Mahalanobis distance?, and lengths of transitional sounds: How long is the duration of wrong recognized results in a phoneme interval. Experimental results indicate that predicted spectra throughout the model are stabilized in each phoneme interval. Moreover, by using the method, intra-category variation decreases and inter-category variation increases. The results also indicate that the model recovers vowel characteristics neutralized by co-articulation at the spectral transition portion and decreases the duration of transitional sounds. Consequently, the spectrum target prediction model implemented as a speech recognition preprocessor reduces recognition error rates.

 

 

Akagi, M. (1993). "Modeling of contextual effects based on spectral peak interaction", J. of Acoust. Society of America, 93, 2, 1076-1086.

ABSTRACT

This paper presents a model of contextual effects able to cope with coarticulation problems, especially vowel neutralization. This model is designed to model the superior recognition ability mechanisms of humans and apply these mechanisms to automatic speech recognition and synthesis. It predicts target spectral peaks in reduced vowels, based on interactions between spectral peak pairs. To construct and substantiate the model, psychoacoustic experiments were carried out to measure the extent of phoneme boundary shift with a single formant stimulus as a preceding anchor. The results of the experiments were compared with the spectral peak interaction results obtained from real speech data using the model. This comparison showed that the obtained spectral peak interactions, measured through perceptual boundary shifts with a single formant anchor, are similar to the spectral peak interactions estimated by the model. Additionally, recovery simulations of reduced spectral peak trajectories with real speech data showed that the spectral peak interactions obtained from the psychoacoustic experiments can be used to predict target spectral peaks from reduced spectral peak trajectories in the same manner as the spectral peak interaction function estimated by the model. These results suggest that the model may be emulating aspects of the human mechanisms, that the contextual effects resulting from the interactions between single formant stimuli can play an important role in improving phoneme neutralization recovery, and that the neutralization recovery model can be formulated as the sum of the interactions between spectral peaks. Furthermore, the model can be implemented as a speech recognition preprocessor to reduce recognition error rates because it can overshoot spectral peak trajectories, shift spectral peaks toward their targets, and increase distances among category centers and Bhattacharyya distances between vowel categories.

 

 

Akagi, M., van Wieringen, A. and Pols, L. C. W. (1994). "Perception of central vowel with pre- and post-anchors", Proc. Int. Conf. Spoken Lang. Process. 94, 503-506.

ABSTRACT

A vowel identification and a vowel matching experiment were performed to examine how preceding and following anchor signals affect central vowel perception. Previous experiments had shown that dynamical aspects of stimuli and the relation between the central vowel and the adjacent anchors may induce overshoot or extrapolation during stimulus processing. The present experiments examine the relative importance of the surrounding steady-state and the formant transitions with regard to center vowel extrapolation. Assuming that continuous phoneme sequences such as consonant-vowel-consonant (CVC) or VVV can be constructed with steady-states, transitions and vowels, four stimulus conditions are used: vowel presented in isolation (called Ref.), vowel with transitions (called _), isolated vowel surrounded by steady-states (called _), and vowel with transitions and steady-states (called _). The central vowels were always 5-formant synthesized vowels, whereas the surrounding steady state and the transitions were either single-formant type sounds or 5-formant type sounds. The experimental results suggest that: (1) central vowel extrapolation occurs with _-type stimuli, in both single- and 5-formant conditions, whereas averaging effects are observed with _- and _-type stimuli for some of the subjects. The overall order of the amount of overshoot is _ > _ > _ > Ref. in the 5-formant condition. The most natural-sounding W-type stimuli showed the largest amount of overshoot, and (2) the amount of overshoot with a 5-formant steady-state is larger than with a single-formant steady-state especially for the P-type stimuli. This might be an indication that the 'vowelness' of the pre- and post-anchors also contributes to the amount of overshoot. The matching results were less consistent.

 

 

Yonezawa, Y. and Akagi, M. (1996). "Modeling of contextual effects and its application to word spotting", Proc. Int. Conf. Spoken Lang. Process. 96, 2063-2066.

ABSTRACT

We propose a model of spectral contextual effects to simulate the superior recognition ability of humans and apply it to a front-end processor for word spotting. This model assumes that perceived spectra are influenced by adjacent spectral peaks and that the magnitude of the influence can be estimated by the minimum classification error criterion. Three experiments were carried out to evaluate the performance of the model. The results show that the model can compensate for neutralized spectra and bring them to their typical patterns. This improves word spotting accuracy.

 

 

Kitamura, T. and Akagi, M. (1994). "Speaker Individualities in speech spectral envelopes", Proc. Int. Conf. Spoken Lang. Process. 94, 1183-1186.

ABSTRACT

Some physical characteristics representing to speaker individualities embedded in spectral envelopes of vowel are investigated through four psychoacoustic experiments. The LMA analysis-synthesis system is used to arrange stimuli varying a specific frequency bands in spectral envelopes and the frequency bands having speaker individualities are estimated. The experimental results suggest that speaker individualities exist over 23.5 ERB Rate (2340 Hz) in the spectral envelopes mainly and can be controlled without influencing vowel identification. Additionally, detailed information of the spectral envelopes is required for speaker identification than that for vowel identification.

 

 

Kitamura, T. and Akagi, M. (1995). "Speaker individualities in speech spectral envelopes", J. Acoust. Soc. Jpn. (E), 16, 5, 283-289.

ABSTRACT

The aim of the three psychoacoustic experiments described here was to clarify whether there are speaker individualities in the spectral envelopes, in which frequency bands such individualities exist, and how frequency bands having speaker individualities can be manipulated. The LMA analysis-synthesis system was used to prepare stimuli varied specific frequency bands, and the frequency bands having speaker individualities were estimated experimentally. The results indicate that (1) speaker individualities exist in spectral envelopes, (2) these individualities are mainly at frequencies higher than 22 ERB rate (2212 Hz) and vowel characteristics exist from 12 ERB rate (603 Hz) to 22 ERB rate and (3) the voice quality can be controlled by replacing the higher frequency band of one talker with that of other talkers. The replace point is the adjacent spectral local minimum below the spectral local maximum around 23 ERB rate in the spectral envelopes.

 

 

Kitamura, T. and Akagi, M. (1996). "Relationship between physical characteristics and speaker individualities in speech spectral envelopes", Proc ASA-ASJ Joint Meeting, 833-838.

ABSTRACT

Significant physical characteristics for speaker identification in speech spectral envelopes of vowels were investigated by psychoacoustic experiments. Our previous studies [T. Kitamura and M. Akagi, J. Acoust. Soc. Jpn (E), 16, 283-289 (1995)] showed that the speaker individualities in spectral envelopes of vowels mainly existed in higher frequency bands. In this study, the effect of the elimination of the spectral peaks and/or dips of spectral envelopes in the higher frequency band on speaker identification was investigated. Additionally, the frequency band with speaker individualities was specified in detail. The stimuli for the experiments were vowels re-synthesized from their FFT cepstral data by using the Log Magnitude Approximation (LMA) analysis-synthesis system. They were normalized pitch frequencies and power, and handled specific frequency bands of the spectral envelope. The experimental results lead to the following conclusions: 1) The peaks in the spectral envelopes were more significant than the dips for speaker identification. 2) Speaker individualities mainly exist in the frequency band above the peak around 20 ERB rate (1740 Hz) and the voice quality can be controlled by replacing in the frequency band of one speaker with that of another.

 

 

Akagi, M. and Ienaga, T. (1995). "Speaker individualities in fundamental frequency contours and its control", Proc. EUROSPEECH95, 439-442.

ABSTRACT

Speaker individualities in F0 contours are investigated through analyses of several speakers' uttered speech and psychoacoustic experiments. The stimuli for the experiments are re-synthesized with manipulated F0 contours and spectral envelopes averaged overall speakers by using the Log Magnitude Approximation analysis-synthesis system. The analysis and experimental results indicate the (1) their are speaker individualities in the F0 contours, (2) some specific parameters related to the dynamics of F0 contours have many speaker individuality features and the speaker individualities can be controlled by manipulating these parameters, and (3) although there are speaker individuality features and the speaker individuality features in the time-averaged F0 contours, they help improve speaker identification less than the dynamics of the F0 contours.

 

 

Akagi, M. and Ienaga, T. (1997). "Speaker individuality in fundamental frequency contours and its control", J. Acoust. Soc. Jpn. (E), 18, 2 73-80.

ABSTRACT

Speaker individualities in fundamental frequency (F0) contours are investigated through analyses of several speakers' uttered speech and psychoacoustic experiments. The analyses are performed to extract significant physical characteristics of F0 by using Fujisaki and Hiroses analysis method and the F-ratio of each physical characteristic. The experiments are performed to clarify the relationship between these physical characteristics and the perception of speakers speech. The stimuli used in the experiments are re-synthesized with manipulated F0 contours and spectral envelopes averaged overall for all speakers by using the Log Magnitude Approximation analysis-synthesis system. The analysis and experimental results indicate that (1) there is speaker individuality in the F0 contours, (2) some specific parameters related to the dynamics of F0 contours have many speaker individuality features and speaker individuality can be controlled by manipulating these parameters, and (3) although there are speaker individuality features in the time-averaged F0, they help improve speaker identification less than the dynamics of the F0 contours.

 

 

Akagi, M., Iwaki, M. and Minakawa, T. (1998). “Fundamental frequency fluctuation in continuous vowel utterance and its perception,” ICSLP98, Sydney.

ABSTRACT

This paper reports how rapid fluctuations of fundamental frequencies in continuously uttered vowels influence vowel quality and shows that vowel qualities with various fundamental frequency fluctuations can be discriminated perceptually. For this purpose, electroglottographs (EGGs) of vowels uttered by nine males were obtained using Laryngograph, and fundamental frequencies with rapid fluctuations were estimated from them. Analyzing forty-five estimated fundamental frequencies, they can be classified into four groups. Moreover, psychoacoustic experiments, with five subjects, evaluating voice quality by multidimensional scaling (MDS) showed that voice quality of the synthesized speech using the fundamental frequencies of the groups was completely discriminable and there was a distinctive frequency band of fundamental frequency fluctuation for specifying each group perceptually.

 

 

Akagi, M., Kitamura, T., Suzuki, N. and Michi, K. (1996). "Perception of lateral misarticulation and its physical correlates", Proc ASA-ASJ Joint Meeting, 933-936.

ABSTRACT

To discuss the relationship between perceptual diagnoses of lateral misarticulation (LM) by sophisticated listeners and their physical correlates, two experiments using continuous speech /sh/ are performed. Experiment 1 compares the spectral envelopes of normal speech /sh/ (NS) with those of LM. Experiment 2 detects similarities between LM and NS with specific spectral envelope bands replaced, based on auditory impressions of sophisticated listeners. The stimuli for experiment 2 were re-synthesized from modified spectral envelopes by using the LMA synthesizer. These experiments show that the spectral envelopes of the LM are flat in the frequency band above approximately 4 kHz, whereas the NS presents a plateau. Moreover, there is a substantial peak at around 3.2 kHz in the LM, which varies with time almost periodically. This variation is not present in NS. The experiments also show that the replacement of the spectral envelope of NS with that of LM between 2.5 and 4.5 kHz results in a remarkable increase in similarity to LM based on auditory impressions. These findings suggest that the spectral envelope characteristic of the LM is near-periodical variation around 3.2 kHz.

 

 

Akagi, M. and Mizumachi, M. (1997). "Noise Reduction by Paired Microphones", Proc. EUROSPEECH97, 335-338.

ABSTRACT

This paper proposes a front-end method for enhancing the target signal by subtracting estimated noise from a noisy signal using paired microphones, assuming that the noise is unevenly distributed with regard to time, frequency, and the direction. Although the Griffiths-Jim type adaptive beamformer has been proposed using the same concept, this method has some drawbacks. For example, sudden noises cannot be reduced because the convergence speed of the adaptive filter is slow; also the signal is distorted in a reverberated environment. The proposed method, however, can overcome the above drawbacks by formulating noises using arrival time differences between paired microphones and by estimating noises analytically using the directions of the noises. The simulated results show that the method with one paired microphone can increase signal-to-noise ratios (SNR) by 10 ~ 20 dB in simulations and can reduce log-spectrum distances by about 5 dB in real noisy environments.

 

 

Mizumachi, M. and Akagi, M. (1998). “Noise reduction by paired-microphones using spectral subtraction,” Proc. ICASSP98, II, 1001-1004

ABSTRACT

This paper proposes a method of noise reduction by paired microphones as a front-end processor for speech recognition systems. This method estimates noises using a subtractive microphone array and subtracts them from the noisy speech signal using the Spectral Subtraction (SS). Since this method can estimate noises analytically and frame by frame, it is easy to estimate noises not depending on these acoustic properties. Therefore, this method can also reduce non stationary noises, for example sudden noises when a door has just closed, which can not be reduced by other SS methods. The results of computer simulations and experiments in a real environment show that this method can reduce LPC log spectral envelope distortions.

 

 

Unoki, M. and Akagi, M. (1997). "A method of signal extraction from noisy signal based on auditory scene analysis", Proc. CASA97, IJCAI-97, Nagoya, 93-102.

ABSTRACT

This paper presents a method of extracting the desired signal from a noise-added signal as a model of acoustic source segregation. Using physical constraints related to the four regularities proposed by Bregman, the proposed method can solve the problem of segregating two acoustic sources. These physical constraints correspond to the regularities, which we have translated from qualitative conditions into quantitative conditions. Three simulations were carried out using the following signals: (a) noise-added AM complex tone, (b) mixed AM complex tones, and (c) noisy synthetic vowel. The performance of the proposed method has been evaluated using two measures: precision, that is, likely SNR, and spectrum distortion (SD). As results using the signals (a) and (b), the proposed method can extract the desired AM complex tone from noise-added AM complex tone or mixed AM complex tones, in which signal and noise exist in the same frequency region. In particular, the average of the reduced SD is about 20 dB. Moreover, as the result using the signal (c), the proposed method can also extract the speech signal from noisy speech.

 

 

Unoki, M. and Akagi, M. (1997). "A method of signal extraction from noisy signal", Proc. EUROSPEECH97, 2587-2590.

ABSTRACT

This paper presents a method of extracting the desired signal from a noise-added signal as a model of acoustic source segregation. Using physical constraints related to the four regularities proposed by Bregman, the proposed method can solve the problem of segregating two acoustic sources. Two simulations were carried out using the following signals: (a) a noise-added AM complex tone and (b) a noisy synthetic vowel. It was shown that the proposed method can extract the desired AM complex tone from noise-added AM complex tone in which signal and noise exist in the same frequency region. The SD was reduced an average of about 20 dB. It was also shown that the proposed method can extract a speech signal from noisy speech.

 

 

Unoki, M. and Akagi, M. (1998). “Signal extraction from noisy signal based on auditory scene analysis,” ICSLP98, Sydney, Vol.5, 2115-2118.

ABSTRACT

This paper proposes a method of extracting the desired signal from a noisy signal. This method solves the problem of segregating two acoustic sources by using constraints related to the four regularities proposed by Bregman and by making two improvements to our previously proposed method. One is to incorporate a method of estimating the fundamental frequency using the Comb filtering on the filterbank. The other is to reconsider the constraints on the separation block, which constrain the instantaneous amplitude, input phase, and fundamental frequency of the desired signal. Simulations performed to segregate a vowel from a noisy vowel and to compare the results of using all or only some constraints showed that our improved method can segregate real speech precisely using all the constraints related to the four regularities and that the absence some constraints reduces the accuracy.

 

 

Akagi, M., Iwaki, M. and Sakaguchi, N. (1998). “Spectral sequence compensation based on continuity of spectral sequence,” ICSLP98, Sydney, Vol.4, 1407-1410.

ABSTRACT

Humans have an excellent ability to select a particular sound source from a noisy environment, called the ``Cocktail-Party Effect'' and to compensate for physically missing sound, called the ``Illusion of Continuity.'' This paper proposes a spectral peak tracker as a model of the illusion of continuity (or phonemic restoration) and a spectral sequence prediction method using a spectral peak tracker. Although some models have already been proposed, they treat only spectral peak frequencies and often generate wrong predicted spectra. We introduce a peak representation of log-spectrum with four parameters: amplitude, frequency, bandwidth, and asymmetry, using the spectral shape analysis method described by the wavelet transformation. And we devise a time-varying second-order system for formulating the trajectories of the parameters. We demonstrate that the model can estimate and track the parameters for connected vowels whose transition section has been partially replaced by white noise.

 

 

Unoki, M. and Akagi, M. (1999). "A method of signal extraction from noisy signal based on auditory scene analysis," Speech Communication, 27, 3-4, 261-279.

ABSTRACT

This paper proposes a method of extracting the desired signal from a noisy signal, addressing the problem of segregating two acoustic sources as a model of acoustic source segregation base on Auditory Scene Analysis. Since the problem of segregating two acoustic sources is an ill-inverse problem, constraints are needed to determine an unique solution. The proposed method uses the four heuristic regularities proposed by Bregman as constraints and uses the instantaneous amplitude and phase of noisy signal components that have passed through a wavelet filterbank as features of acoustic sources. Then the model can extracts the instantaneous amplitude and phase of the desired signal. Simulations were performed to segregate the harmonic complex tone from a noise-added harmonic complex tone and to compare the results of using all or only some constraints. The results show that the method can segregate the harmonic complex tone precisely using all the constraints related to the four regularities and that the absence some constraints reduces the accuracy.

 

 

Mizumachi, M. and Akagi, M. (1999). "Noise reduction method that is equipped for robust direction finder in adverse environments," Proc. Workshop on Robust Methods for Speech Recognition in Adverse Conditions, Tampere, Finland, 179-182.

ABSTRACT

In this paper, a robust direction finder in noisy environments is proposed. The authors have put it into the noise reduction algorithm proposed before which estimates the noise spectrum using a speech and a noise directions, and subtracts it using the Spectral Subtraction. It is confirmed that a direction finder works very accurately and this 3ch. subtractive array whose performance corresponds to 6ch. delay-and-sum array is effective as a front-end of speech recognition systems in some experiments.

 

 

Unoki, M. and Akagi, M. (1999). "Segregation of vowel in background noise using the model of segregating two acoustic sources based on auditory scene analysis", Proc. CASA99, IJCAI-99, Stockholm, 51-60.

ABSTRACT

This paper proposes an auditory sound segregation model based on auditory scene analysis. It solves the problem of segregating two acoustic sources by using constraints related to the heuristic regularities proposed by Bregman and by making an improvement to our previously proposed model. The improvement is to reconsider constraints on the continuity of instantaneous phases as well as constraints on the continuity of instantaneous amplitudes and fundamental frequencies in order to segregate the desired signal from a noisy signal precisely even in waveforms. Simulations performed to segregate a real vowel from a noisy vowel and to compare the results of using all or only some constraints showed that our improved model can segregate real speech precisely even in waveforms using all the constraints related to the four regularities, and that the absence of some constraints reduces the segregation accuracy.

 

 

Unoki, M. and Akagi, M. (1999). "Segregation of vowel in background noise using the model of segregating two acoustic sources based on auditory scene analysis", Proc. EUROSPEECH99, 2575-2578.

ABSTRACT

This paper proposes an auditory sound segregation model based on auditory scene analysis. It solves the problem of segregating two acoustic sources by using constraints related to the heuristic regularities proposed by Bregman and by making an improvement to our previously proposed model. The improvement is to reconsider constraints on the continuity of instantaneous phases as well as constraints on the continuity of instantaneous amplitudes and fundamental frequencies in order to segregate the desired signal from a noisy signal precisely even in waveforms. Simulations performed to segregate a real vowel from a noisy vowel and to compare the results of using all or only some constraints showed that our improved model can segregate real speech precisely even in waveforms using all the constraints related to the four regularities, and that the absence of some constraints reduces the segregation accuracy.

 

 

Mizumachi, M. and Akagi, M. (1999). "An objective distortion estimator for hearing aids and its application to noise reduction," Proc. EUROSPEECH99, 2619-2622.

ABSTRACT

In this paper, an objective distortion estimator called auditory-oriented spectral distortion(ASD) is proposed. It is confirmed that the ASD can accurately predict the auditory perceptual distortions represented by the mean opinion score(MOS). The ASD is used as a criterion to optimize the noise reduction algorithm for instruments that need to reduce noises appealing to the ear, for example hearing aids. Experimental results say that a suitable value of the parameter should be different for the purpose to use as between improving the auditory impressions and a front-end of speech recognition systems.

 

 

Mizumachi, M. and Akagi, M. (1999). "The auditory-oriented spectral distortion for evaluating speech signals distorted by additive noises," J. Acoust. Soc. Jpn. (E), (In printing).

ABSTRACT

This paper proposes an objective speech distortion measure as a substitute for human auditory systems. Simultaneous and temporal masking effects are introduced into this measure called auditory-oriented Spectral Distortion(ASD). We calculate the ASD using spectral components over masked thresholds in the same way as the Spectral Distortion(SD). We confirmed that the ASD is more compatible to subjective mean opinion score that represents distortions on auditory impression than the SD. We applied the ASD to optimize a noise reduction algorithm proposed by the authors, and confirmed that this optimized algorithm reduces noises appearing to the ear. ASD is sure to be an available guide to design noise reduction algorithms.

 

 

Maki, K. and Akagi, M. (1997). "A functional model of the auditory peripheral system", Proc. ASVA97, Tokyo, 703-710.

ABSTRACT

This paper presents a functional model of the auditory peripheral system for obtaining input impulse trains to the central auditory system. To model the external ear, the middle ear, the basilar membrane (BM) and the outer hair cell (OHC), a dual analog model of the ascending path reported by Gigue're et al.(1994) is adopted. In this paper, we developed an inner hair cell (IHC) model by extending Meddis's model (Meddis et al., 1986, 1988). This model can simulate nonlinear transducer functions of the IHC, which are depolarized and hyperpolarized peak responses as a function of the peak sound pressure level and the DC components of the receptor potential as a function of the stimulus level. An auditory nerve (AN) model is proposed using Hodgkin's cell membrane model (1952) to generate nerve impulse trains. These models are combined into a functional model of the auditory peripheral system. Outputs of the functional model are compared with physiological experimental data. The results show that the proposed model is in excellent agreement with the physiological data and that the model is effective in providing primary inputs to central auditory processing models. Additionally, using vowels as input data for the model, we can obtain discharge patterns of all CFs from the output of the model. These patterns show how the vowel features are represented in the auditory peripheral system.

 

 

Maki, K., Hirota, K. and Akagi, M. (1998). “A functional model of the auditory peripheral system: Responses to simple and complex stimuli,” Computational Hearing, Italy, 13-18.

ABSTRACT

An auditory nerve (AN) model is proposed by functionally modeling membrane potential change and firing mechanism of the AN to generate a nerve train of spikes and to control discharge patterns of the model in both intensity and timing easily by the AN model parameters. To model the external ear, the middle ear, the basilar membrane, the outer hair cell and inner hair cell, a dual analog model of the ascending path reported by Gigu`ere and Woodland (1994) is used. These models are combined into a functional model of the auditory peripheral system. To evaluate the auditory peripheral model, response patterns of the model in both simple and complex stimuli, such as vowel, are compared with physiological experimental data in detail. The evaluation show that the proposed model can simulate various responses of the AN fibers, containing rapid and short-term adaptation, phase-locking versus intensity and frequency of stimulation, recovery from adaptation, response changes to intensity, hazard function for driven activity and representation of vowels in term of timing and intensity. Accordingly, the proposed model can provide primary inputs to the central auditory processing models.

 

 

Itoh, K. and Akagi, M. (1998). “A computational model of auditory sound localization,” Computational Hearing, Italy, 67-72

ABSTRACT

This paper presents a computational model of auditory sound localization based on the inter-aural time difference (ITD). Nerve impulses and synaptic transmission in the nervous system are modeled computationally and these are applied to a coincidence detector circuit model to detect ITDs. To determine ITDs more accurately and effectively, a multi-threshold model and an inhibition model are adopted to emphasize ITDs. As results of simulations, implementation of temporal redundancy in nerve impulses and synaptic transmission is useful to improve accuracy of coincidence detection of impulses fluctuating in time.

 

 

Unoki, M. and Akagi, M. (1998). “A computational model of co-modulation masking release,” Computational Hearing, Italy, 129-134.

ABSTRACT

This paper proposes a computational model of co-modulation masking release (CMR). It consists of two models, our auditory segregation model (model A) and the power spectrum model of masking (model B), and a selection process that selects one of their results. Model A extracts a sinusoidal signal using the outputs of multiple auditory filters and model B extracts a sinusoidal signal using the output of a single auditory filter. The selection process selects the sinusoidal signal with the lowest signal threshold from the two extracted signals. For both models, simulations similar to Hall et al.'s demonstrations were carried out. Simulation stimuli consisted of two types of noise masker, bandpassed random noise and AM bandpassed random noise. As a result, the signal threshold of the pure tone extracted using the proposed model shows the similar properties to Hall et al.'s demonstrations. The maximum amount of CMR in the proposed model is about 8 dB.

 

 

Nandasena, A.C.R. and Akagi, M. (1998). “Spectral stability based event localizing temporal decomposition,” Proc. ICASSP98, II, 957-960

ABSTRACT

In this paper a new approach to temporal decomposition (TD) of speech, called "Spectral Stability Based Event Localizing Temporal Decomposition", abbreviated S2BEL-TD, is presented. The original method of TD proposed by Atal is known to have the drawbacks of high computational cost, and the instability of the number and locations of events. In S2BEL-TD, the event localization is performed based on a maximum spectral stability criterion. This overcomes the instability problem of events of the Atal's method. Also, S 2 BEL-TD avoids the use of the computationally costly singular value decomposition routine used in the Atal's method, thus resulting in a computationally simpler algorithm of TD. Simulation results show that an average spectral distortion of about 1.5 dB can be achieved with LSF as the spectral parameter. Also, we have shown that the temporal pattern of the speech excitation parameters can also be well described using the S2BEL-TD technique.