Papers by Trausti Kristjansson
ArXiv, 2020
Music source separation has been a popular topic in signal processing for decades, not only becau... more Music source separation has been a popular topic in signal processing for decades, not only because of its technical difficulty, but also due to its importance to many commercial applications, such as automatic karoake and remixing. In this work, we propose a novel self-attention network to separate voice and accompaniment in music. First, a convolutional neural network (CNN) with densely-connected CNN blocks is built as our base network. We then insert self-attention subnets at different levels of the base CNN to make use of the long-term intra-dependency of music, i.e., repetition. Within self-attention subnets, repetitions of the same musical patterns inform reconstruction of other repetitions, for better source separation performance. Results show the proposed method leads to 19.5% relative improvement in vocals separation in terms of SDR. We compare our methods with state-of-the-art systems i.e. MMDenseNet and MMDenseLSTM.
2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings
We consider the problem of robust speech recognition in the car environment. We present a new dyn... more We consider the problem of robust speech recognition in the car environment. We present a new dynamic noise adaptation algorithm, called DNA, for the robust front-end compensation of evolving semi-stationary noise as typically encountered in the car setting. A large dataset of in-car noise was collected for the evaluation of the new algorithm. This dataset was combined with the Aurora II framework to produce a new, publicly available framework, called DNA + AURORA II, for the evaluation of adaptive noise compensation algorithms. We show that DNA consistently outperforms several existing, related state-of-the-art front-end denoising techniques
IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.
The performance of speech cleaning and noise adaptation algorithms is heavily dependent on the qu... more The performance of speech cleaning and noise adaptation algorithms is heavily dependent on the quality of the noise and channel models. Various strategies have been proposed in the literature for adapting to the current noise and channel conditions. In this paper, we describe the joint learning of noise and channel distortion in a novel framework called ALGONQUIN. The learning algorithm employs a generalized EM strategy wherein the E step is approximate. We discuss the characteristics of the new algorithm, with a focus on convergence rates and parameter initialization. We show that the learning algorithm can successfully disentangle the non-linear effects of noise and linear effects of the channel and achieve a relative reduction in WER of 21.8% over the non-adaptive algorithm.
2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698), 2003
A variational inference algorithm for robust speech separation, capable of recovering the underly... more A variational inference algorithm for robust speech separation, capable of recovering the underlying speech sources even in the case of more sources than microphone observations, is presented. The algorithm is based upon an generative probabilistic model that fuses time-delay of arrival (TDOA) information with prior information about the speakers and application, to produce an optimal estimate of the underlying speech sources. Simulation results are presented for the case of two, three and four underlying sources and two microphones observations corrupted by noise. The resulting SNR gains (24dB with two sources, 15dB with three sources, and 9dB with four sources) are significantly higher than previous speech separation techniques.
We describe a system that can separate and recognize the simultaneous speech of two speakers from... more We describe a system that can separate and recognize the simultaneous speech of two speakers from a single channel recording and compare the performance of the system to that of human subjects. The system, which we call Iroquois, uses models of dynamics to achieve performance near that of human listeners. However the system exhibits a pattern of performance across conditions that is different from that of human subjects. In conditions where the amplitude of the speakers is similar, the Iroquois model surpasses human performance by over 50%. We hypothesize that the system accomplishes this remarkable feat by employing a different strategy to that of the human auditory system.
We present a probabilistic framework that uses a bone sensor and air microphone to perform speech... more We present a probabilistic framework that uses a bone sensor and air microphone to perform speech enhancement for robust speech recognition. The system exploits advantages of both sensors: the noise resistance of the bone sensor, and the linearity of the air microphone. In this paper we describe the general properties of the bone sensor relative to conventional air sensors. We propose a model capable of adapting to the noise conditions, and evaluate performance using a commercial speech recognition system. We demonstrate considerable improvements in recognition-from a baseline of 57% up to nearly 80% word accuracy-for four subjects on a difficult condition with background speaker interference.
Perceiving sounds in a noisy environment is a challenging problem. Visual lip-reading can provide... more Perceiving sounds in a noisy environment is a challenging problem. Visual lip-reading can provide relevant information but is also challenging because lips are moving and a tracker must deal with a variety of conditions. Typically audio-visual systems have been assembled from individually engineered modules. We propose to fuse audio and video in a probabilistic generative model that implements cross-model self-supervised learning, enabling adaptation to audio-visual data. The video model features a Gaussian mixture model embedded in a linear subspace of a sprite which translates in the video. The system can learn to detect and enhance speech in noise given only a short (30 second) sequence of audio-visual data. We show some results for speech detection and enhancement, and discuss extensions to the model that are under investigation.
One approach to achieving noise and distortion robust speech recognition is to remove noise and d... more One approach to achieving noise and distortion robust speech recognition is to remove noise and distortion with algorithms of low complexity prior to the use of much higher complexity speech recognizers. This approach has been referred to as cleaning. In this paper we present an approach for speech cleaning using a time-varying, non-linear probabilistic model of a signals log Mel-filter-bank representation. We then present a new non-linear probabilistic inference technique and show results using this technique within the probabilistic cleaning model. In this approach we represent distributions for underlying noise, speech and channel characteristics as Gaussian mixtures and use Gaussian basis functions to model the non-linear likelihood function. This allows us to efficiently compute complex multi-modal probability distributions over speech and noise components of the underlying signal. We show how this method can be used to clean speech features and present results using the Aurora...
We consider the task of speech recognition with loud music background interference. We use model-... more We consider the task of speech recognition with loud music background interference. We use model-based music-speech separation and train GMM models for music on the audio prior to speech. We show over 8 % relative improvement in WER at 10 dB SNR for a real world Voice Search ASR system. We investigate the relationship between ASR accuracy and the amount of music background used as prologue and the the size of music models. Our study shows that performance peaks when using a music prologue of around 6 seconds to train the music model. We hypothesize that this is due to the dynamic nature of music and the structure of popular music. Adding more history beyond a certain point does not improve results. Additionally, we show moderately sized 8-component music GMM models suffice to model this amount of music prologue. Index Terms — ASR, noise robustness, noise reduction, non-stationary noise, music 1.
We present a method for separating two speakers from a single microphone channel. The method expl... more We present a method for separating two speakers from a single microphone channel. The method exploits the fine structure of male and female speech and relies on a strong high frequency resolution model for the source signals. The algorithm is able to identify the correct combination of male and female speech that best explains an observation and is able to reconstruct the component signals, relying on prior knowledge to ‘fill in ’ regions that are masked by the other speaker. The two speaker single microphone source separation problem is one of the most challenging source separation scenarios and few quantitative results have been reported in the literature. We provide a test set based on the Aurora 2 data set and report performance numbers on a portion of this set. We achieve results of 6.59 dB average increase in SNR for female speakers and 5.51 dB for male speakers. 1.
One approach to robust speech recognition is to use a simple speech model to remove the distortio... more One approach to robust speech recognition is to use a simple speech model to remove the distortion, before applying the speech recognizer. Previous attempts at this approach have relied on unimodal or point estimates of the noise for each utterance. In challenging acoustic environments, e.g., an airport, the spectrum of the noise changes rapidly during an utterance, making a point estimate a poor representation. We show how an iterative form of Laplace’s method can be used to estimate the clean speech, using a time-varying probability model of the log-spectra of the clean speech, noise and channel distortion. We use this method, called ALGONQUIN, to denoise speech features and then feed these features into a large vocabulary speech recognizer whose WER on the clean Wall Street Journal data is 4.9%. When 10 dB of noise consisting of an airplane engine shutting down is added to the data, the recognizer obtains a WER of 28.8%. ALGONQUIN reduces the WER to 12.6%, well below the WER of 2...
Information Extraction methods can be used to automatically "fill-in" database forms fr... more Information Extraction methods can be used to automatically "fill-in" database forms from unstructured data such as Web documents or email. State-of-the-art methods have achieved low error rates but invariably make a number of errors. The goal of an interactive information extraction system is to assist the user in filling in database fields while giving the user confidence in the integrity of the data. The user is presented with an interactive interface that allows both the rapid verification of automatic field assignments and the correction of errors. In cases where there are multiple errors, our system takes into account user corrections, and immediately propagates these constraints such that other fields are often corrected automatically. Linear-chain conditional random fields (CRFs) have been shown to perform well for information extraction and other language modelling tasks due to their ability to capture arbitrary, overlapping features of the input in a Markov model...
We present a framework for speech enhancement and robust speech recognition that exploits the har... more We present a framework for speech enhancement and robust speech recognition that exploits the harmonic structure of speech. We achieve substantial gains in signal to noise ratio (SNR) of enhanced speech as well as considerable gains in accuracy of automatic speech recognition in very noisy conditions. The method exploits the harmonic structure of speech by employing a high frequency resolution speech model in the log-spectrum domain and reconstructs the signal from the estimated posteriors of the clean signal and the phases from the original noisy signal. We achieve a gain in signal to noise ratio of 8.38 dB for enhancement of speech at 0 dB. We also present recognition results on the Aurora 2 data-set. At 0 dB SNR, we achieve a reduction of relative word error rate of 43.75 % over the baseline, and 15.90 % over the equivalent low-resolution algorithm. 1.
This paper introduces the Laplace algorithm for de-noising in the cepstrum domain with applicatio... more This paper introduces the Laplace algorithm for de-noising in the cepstrum domain with applications to speech recognition. Our method uses Gaussian mixture priors for clean speech and noise cepstra and assumes that speech and noise mix linearly in the spectrum domain. The Laplace algorithm involves two steps (a) computing the posterior mode of the observed noisy cepstra and (b) Gaussian approximation of the posterior around the mode. We show that the Algonquin algorithm is a special case of our approach where a Newton method is used for (a). Interestingly, this observation also proves that the Algonquin algorithm does not converge in general. We propose the use of the BFGS method for (a) which also allows us to efficiently apply the Laplace algorithm in the cepstral domain. De-noising in the cepstral domain gives more than 31% relative reduction in word error rate on average on the Aurora 2 task.
To address the emerging needs of access to and retrieval of multimedia objects in many applicatio... more To address the emerging needs of access to and retrieval of multimedia objects in many applications, we have started a Multimedia Analysis and Retrieval Systems project at the University of Illinois. This project addresses three main aspects in Multimedia Information Retrieval, i.e. feature extraction, multimedia object description, and retrieval algorithm. Although MPEG-7 will concentrate only on multimedia object description, such a goal will be better accomplished if its interfaces to feature extraction and retrieval algorithm are appropriate defined. In this proposal, we will first give a brief overview of the MARS system. Then we propose a multimedia object model for MPEG-7's content description interface. The proposed model allows information abstraction at various semantic levels. To better model human perception subjectivity to multimedia data, relevance feedback is integrated into the retrieval process. Our experimental results show that the proposed multimedia object m...
Far-field automatic speech recognition (ASR) is a key enabling technology that allows untethered ... more Far-field automatic speech recognition (ASR) is a key enabling technology that allows untethered and natural voice interaction between users and Amazon Echo family of products. A key component in realizing far-field ASR on these products is the suite of audio front-end (AFE) algorithms that helps in mitigating acoustic environmental challenges and thereby improving the ASR performance. In this paper, we discuss the key algorithms within the AFE, and we provide insights into how these algorithms help in mitigating the various acoustical challenges for far-field processing. We also provide insights into the audio algorithm architecture adopted for the AFE, and we discuss ongoing and future research.
The Audio Front-End (AFE) is a key component in mitigating acoustic environmental challenges for ... more The Audio Front-End (AFE) is a key component in mitigating acoustic environmental challenges for far-field automatic speech recognition (ASR) on Amazon Echo family of products. A critical component of the AFE is the Beam Selector, which identifies which beam points to the target user. In this paper, we proposed a new SIR beam selector that utilizes subband-based signal-to-interference ratios to learn the locations of the audio sources and therefore further improve the beam selection accuracy for multi-microphone based AFE system. We analyzed the performance of a Signal to Interference Ratio (SIR) beam selector with a comparison to classic beam selector using the datasets collected under various conditions. This method is evaluated and shown to simultaneously decrease word-error-rate (WER) for speech recognition by up to 46.20% and improve barge-in performance via FRR by up to 39.18%.
One approach to robust speech recognition is to use a simple speech model to remove the distortio... more One approach to robust speech recognition is to use a simple speech model to remove the distortion, before applying the speech recognizer. Previous attempts at this approach have relied on unimodal or point estimates of the noise for each utterance. In challenging acoustic environments, e.g., an airport, the spectrum of the noise changes rapidly during an utterance, making a point estimate a poor representation. We show how an iterative form of Laplace’s method can be used to estimate the clean speech, using a time-varying probability model of the log-spectra of the clean speech, noise and channel distortion. We use this method, called ALGONQUIN, to denoise speech features and then feed these features into a large vocabulary speech recognizer whose WER on the clean Wall Street Journal data is 4.9%. When 10 dB of noise consisting of an airplane engine shutting down is added to the data, the recognizer obtains a WER of 28.8%. ALGONQUIN reduces the WER to 12.6%, well below the WER of 2...
We introduce a novel acoustic echo cancellation framework for systems where the loudspeaker and t... more We introduce a novel acoustic echo cancellation framework for systems where the loudspeaker and the microphone array are not synchronized. We consider the problem in the most general form where the loss of synchronization is time-varying. The proposed system is linear and it utilizes microphone array beamforming for echo cancellation. It is shown to provide significant improvement over standard echo cancellation and noise suppression techniques in both noise suppression and speech recognition.
Uploads
Papers by Trausti Kristjansson