EBookClubs

Read Books & Download eBooks Full Online

EBookClubs

Read Books & Download eBooks Full Online

Book Model based Speech Enhancement Exploiting Temporal and Spectral Dependencies

Download or read book Model based Speech Enhancement Exploiting Temporal and Spectral Dependencies written by Thomas Esch and published by . This book was released on 2012 with total page 162 pages. Available in PDF, EPUB and Kindle. Book excerpt:

Book Speech Enhancement with Improved Deep Learning Methods

Download or read book Speech Enhancement with Improved Deep Learning Methods written by Mojtaba Hasannezhad and published by . This book was released on 2021 with total page 0 pages. Available in PDF, EPUB and Kindle. Book excerpt: In real-world environments, speech signals are often corrupted by ambient noises during their acquisition, leading to degradation of quality and intelligibility of the speech for a listener. As one of the central topics in the speech processing area, speech enhancement aims to recover clean speech from such a noisy mixture. Many traditional speech enhancement methods designed based on statistical signal processing have been proposed and widely used in the past. However, the performance of these methods was limited and thus failed in sophisticated acoustic scenarios. Over the last decade, deep learning as a primary tool to develop data-driven information systems has led to revolutionary advances in speech enhancement. In this context, speech enhancement is treated as a supervised learning problem, which does not suffer from issues faced by traditional methods. This supervised learning problem has three main components: input features, learning machine, and training target. In this thesis, various deep learning architectures and methods are developed to deal with the current limitations of these three components. First, we propose a serial hybrid neural network model integrating a new low-complexity fully-convolutional convolutional neural network (CNN) and a long short-term memory (LSTM) network to estimate a phase-sensitive mask for speech enhancement. Instead of using traditional acoustic features as the input of the model, a CNN is employed to automatically extract sophisticated speech features that can maximize the performance of a model. Then, an LSTM network is chosen as the learning machine to model strong temporal dynamics of speech. The model is designed to take full advantage of the temporal dependencies and spectral correlations present in the input speech signal while keeping the model complexity low. Also, an attention technique is embedded to recalibrate the useful CNN-extracted features adaptively. Through extensive comparative experiments, we show that the proposed model significantly outperforms some known neural network-based speech enhancement methods in the presence of highly non-stationary noises, while it exhibits a relatively small number of model parameters compared to some commonly employed DNN-based methods. Most of the available approaches for speech enhancement using deep neural networks face a number of limitations: they do not exploit the information contained in the phase spectrum, while their high computational complexity and memory requirements make them unsuited for real-time applications. Hence, a new phase-aware composite deep neural network is proposed to address these challenges. Specifically, magnitude processing with spectral mask and phase reconstruction using phase derivative are proposed as key subtasks of the new network to simultaneously enhance the magnitude and phase spectra. Besides, the neural network is meticulously designed to take advantage of strong temporal and spectral dependencies of speech, while its components perform independently and in parallel to speed up the computation. The advantages of the proposed PACDNN model over some well-known DNN-based SE methods are demonstrated through extensive comparative experiments. Considering that some acoustic scenarios could be better handled using a number of low-complexity sub-DNNs, each specifically designed to perform a particular task, we propose another very low complexity and fully convolutional framework, performing speech enhancement in short-time modified discrete cosine transform (STMDCT) domain. This framework is made up of two main stages: classification and mapping. In the former stage, a CNN-based network is proposed to classify the input speech based on its utterance-level attributes, i.e., signal-to-noise ratio and gender. In the latter stage, four well-trained CNNs specialized for different specific and simple tasks transform the STMDCT of noisy input speech to the clean one. Since this framework is designed to perform in the STMDCT domain, there is no need to deal with the phase information, i.e., no phase-related computation is required. Moreover, the training target length is only one-half of those in the previous chapters, leading to lower computational complexity and less demand for the mapping CNNs. Although there are multiple branches in the model, only one of the expert CNNs is active for each time, i.e., the computational burden is related only to a single branch at anytime. Also, the mapping CNNs are fully convolutional, and their computations are performed in parallel, thus reducing the computational time. Moreover, this proposed framework reduces the latency by %55 compared to the models in the previous chapters. Through extensive experimental studies, it is shown that the MBSE framework not only gives a superior speech enhancement performance but also has a lower complexity compared to some existing deep learning-based methods.

Book Speech Enhancement Exploiting the Source filter Model

Download or read book Speech Enhancement Exploiting the Source filter Model written by Samy Elshamy and published by . This book was released on 2020 with total page 0 pages. Available in PDF, EPUB and Kindle. Book excerpt: Imagining everyday life without mobile telephony is nowadays hardly possible. Calls are being made in every thinkable situation and environment. Hence, the microphone will not only pick up the user's speech but also sound from the surroundings which is likely to impede the understanding of the conversational partner. Modern speech enhancement systems are able to mitigate such effects and most users are not even aware of their existence. In this thesis the development of a modern single-channel speech enhancement approach is presented, which uses the divide and conquer principle to combat environmental noise in microphone signals. Though initially motivated by mobile telephony applications, this approach can be applied whenever speech is to be retrieved from a corrupted signal. The approach uses the so-called source-filter model to divide the problem into two subproblems which are then subsequently conquered by enhancing the source (the excitation signal) and the filter (the spectral envelope) separately. Both enhanced signals are then used to denoise the corrupted signal. The estimation of spectral envelopes has quite some history and some approaches already exist for speech enhancement. However, they typically neglect the excitation signal which leads to the inability of enhancing the fine structure properly. Both individual enhancement approaches exploit benefits of the cepstral domain which offers, e.g., advantageous mathematical properties and straightforward synthesis of excitation-like signals. We investigate traditional model-based schemes like Gaussian mixture models (GMMs), classical signal processing-based, as well as modern deep neural network (DNN)-based approaches in this thesis. The enhanced signals are not used directly to enhance the corrupted signal (e.g., to synthesize a clean speech signal) but as so-called a priori signal-to-noise ratio (SNR) estimate in a traditional statistical speech enhancement system. Such a traditional system consists of a noise power estimator, an a priori SNR estimator, and a spectral weighting rule that is usually driven by the results of the aforementioned estimators and subsequently employed to retrieve the clean speech estimate from the noisy observation. As a result the new approach obtains significantly higher noise attenuation compared to current state-of-the-art systems while maintaining a quite comparable speech component quality and speech intelligibility. In consequence, the overall quality of the enhanced speech signal turns out to be superior as compared to state-of-the-art speech ehnahcement approaches.

Book Audio Source Separation and Speech Enhancement

Download or read book Audio Source Separation and Speech Enhancement written by Emmanuel Vincent and published by John Wiley & Sons. This book was released on 2018-10-22 with total page 517 pages. Available in PDF, EPUB and Kindle. Book excerpt: Learn the technology behind hearing aids, Siri, and Echo Audio source separation and speech enhancement aim to extract one or more source signals of interest from an audio recording involving several sound sources. These technologies are among the most studied in audio signal processing today and bear a critical role in the success of hearing aids, hands-free phones, voice command and other noise-robust audio analysis systems, and music post-production software. Research on this topic has followed three convergent paths, starting with sensor array processing, computational auditory scene analysis, and machine learning based approaches such as independent component analysis, respectively. This book is the first one to provide a comprehensive overview by presenting the common foundations and the differences between these techniques in a unified setting. Key features: Consolidated perspective on audio source separation and speech enhancement. Both historical perspective and latest advances in the field, e.g. deep neural networks. Diverse disciplines: array processing, machine learning, and statistical signal processing. Covers the most important techniques for both single-channel and multichannel processing. This book provides both introductory and advanced material suitable for people with basic knowledge of signal processing and machine learning. Thanks to its comprehensiveness, it will help students select a promising research track, researchers leverage the acquired cross-domain knowledge to design improved techniques, and engineers and developers choose the right technology for their target application scenario. It will also be useful for practitioners from other fields (e.g., acoustics, multimedia, phonetics, and musicology) willing to exploit audio source separation or speech enhancement as pre-processing tools for their own needs.

Book Speech Processing in Mobile Environments

Download or read book Speech Processing in Mobile Environments written by K. Sreenivasa Rao and published by Springer Science & Business Media. This book was released on 2014-01-28 with total page 129 pages. Available in PDF, EPUB and Kindle. Book excerpt: This book focuses on speech processing in the presence of low-bit rate coding and varying background environments. The methods presented in the book exploit the speech events which are robust in noisy environments. Accurate estimation of these crucial events will be useful for carrying out various speech tasks such as speech recognition, speaker recognition and speech rate modification in mobile environments. The authors provide insights into designing and developing robust methods to process the speech in mobile environments. Covering temporal and spectral enhancement methods to minimize the effect of noise and examining methods and models on speech and speaker recognition applications in mobile environments.

Book Speech Enhancement Methods Based on CASA Incorporating Spectral Correlation

Download or read book Speech Enhancement Methods Based on CASA Incorporating Spectral Correlation written by Feng Bao and published by . This book was released on 2018 with total page 141 pages. Available in PDF, EPUB and Kindle. Book excerpt: Computational auditory scene analysis (CASA) has shown a great potential for speech enhancement compared to some statistical model-based methods. A challenge for CASA is how to estimate binary mask or ratio mask effectively in each time-frequency (T-F) unit. In this thesis, four speech enhancement methods with binary mask or ratio mask estimation are proposed based on the spectral relationship among noisy speech, pure noise and clean speech. The common use of fixed thresholds in the conventional CASA method constrains segregation and T-F unit labeling, affecting the performance of de-noising. Thus, an adaptive factor is first derived from the power spectra of noisy speech and estimated noise to replace those fixed thresholds. As a result, noise reduction is achieved with improved pitch contour and T-F unit labeling. A new binary mask estimation method is proposed based on convex optimization to reduce artifacts and temporal discontinuity caused by the inaccuracy of binary mask estimation. Signal segregation and pitch estimation are not needed in this method; only speech power is considered as a key cue for labeling the binary mask. The cross-correlation between the noisy speech and estimated noise power spectra in each channel is employed to build the objective function. The T-F units of speech and noise are labeled with a decision factor derived from the powers of noisy speech, estimated speech, and pre-estimated noise respectively. Erroneous local masks are refined by time-frequency unit smoothing. As a consequence, noise is effectively reduced and the perceptual quality of the enhanced speech is improved. A new estimation method of ratio mask in terms of Wiener filtering is proposed in order to further increase the temporal continuity of reconstructed speech. In this method, the speech power of each T-F unit is obtained by a convex optimization method. The objective function depends also on the cross-correlation between the noisy speech and estimated noise power spectra. To improve the accuracy of estimation, the estimated ratio mask is further modified based on its adjacent time-frequency units and then smoothed by interpolating with the estimated binary masks. The results confirmed that the performances related to noise reduction, speech quality, and speech intelligibility are all improved. A novel ratio mask representation by exploiting the inter-channel correlation (ICC) among the noisy speech, pure noise and clean speech spectra is proposed to further improve enhancement performance. In this way, the power ratio of speech and noise is reallocated adaptively during the construction of ratio mask, so that more speech components are retained and more noise components are masked. In this method, the channel-weight contour based on the equal loudness hearing attribute is adopted to revise the ratio mask in each T-F unit. The developed ratio mask is utilized to train a five-layer Deep Neural Network (DNN) with other features. Experiments show significant improvements in speech quality and intelligibility compared to DNN-based methods without ICC.

Book DFT Domain Based Single Microphone Noise Reduction for Speech Enhancement

Download or read book DFT Domain Based Single Microphone Noise Reduction for Speech Enhancement written by Richard C. Hendriks and published by Springer Nature. This book was released on 2022-05-31 with total page 70 pages. Available in PDF, EPUB and Kindle. Book excerpt: As speech processing devices like mobile phones, voice controlled devices, and hearing aids have increased in popularity, people expect them to work anywhere and at any time without user intervention. However, the presence of acoustical disturbances limits the use of these applications, degrades their performance, or causes the user difficulties in understanding the conversation or appreciating the device. A common way to reduce the effects of such disturbances is through the use of single-microphone noise reduction algorithms for speech enhancement. The field of single-microphone noise reduction for speech enhancement comprises a history of more than 30 years of research. In this survey, we wish to demonstrate the significant advances that have been made during the last decade in the field of discrete Fourier transform domain-based single-channel noise reduction for speech enhancement.Furthermore, our goal is to provide a concise description of a state-of-the-art speech enhancement system, and demonstrate the relative importance of the various building blocks of such a system. This allows the non-expert DSP practitioner to judge the relevance of each building block and to implement a close-to-optimal enhancement system for the particular application at hand. Table of Contents: Introduction / Single Channel Speech Enhancement: General Principles / DFT-Based Speech Enhancement Methods: Signal Model and Notation / Speech DFT Estimators / Speech Presence Probability Estimation / Noise PSD Estimation / Speech PSD Estimation / Performance Evaluation Methods / Simulation Experiments with Single-Channel Enhancement Systems / Future Directions

Book Robust Speech Enhancement in the Time Domain

Download or read book Robust Speech Enhancement in the Time Domain written by Ashutosh Pandey and published by . This book was released on 2022 with total page 0 pages. Available in PDF, EPUB and Kindle. Book excerpt: Speech is the primary mode of human communication and a natural interface for human-machine interaction. However, background noise in the real world creates difficulty for both human and machine listeners. Speech enhancement aims at removing or attenuating background noise from degraded speech. In contrast to the widely accepted time-frequency (T-F) based methods, time-domain speech enhancement aims at estimating the clean speech samples directly from noisy speech samples. Time-domain speech enhancement using deep neural networks (DNNs) is an exciting research direction due to its potential of jointly enhancing the spectral magnitude and phase by utilizing strong modeling capabilities of DNNs. This dissertation presents a systematic effort to develop monaural time-domain speech enhancement systems using DNNs. We start by developing a novel framework for time-domain speech enhancement. It includes a convolutional neural network (CNN) for time-domain enhancement and a spectral magnitude based loss for supervised training. CNNs are more suitable for learning representations from raw waveforms by utilizing local correlations. The loss over spectral magnitude aids supervised learning in recognizing discriminative patterns of speech and noise over different frequency bands. The proposed framework significantly outperforms a strong T-F based gated residual network (GRN) model for spectral magnitude enhancement. Many real-world applications, such as hearing aids and teleconferencing, require real-time speech enhancement. Next, we develop a real-time speech enhancement system called TCNN: Temporal Convolutional Neural Network, a novel utterance-based CNN that utilizes causal and dilated temporal convolutions. Causal convolutions are crucial for low algorithmic latency and dilated convolutions help with long-range context aggregation. TCNN is shown to outperform T-F based baseline models, including a unidirectional long short-term memory (LSTM) model and a convolutional recurrent network (CRN) model. Further, we advance time-domain enhancement by improving both the CNN architecture and the loss function for training. The CNN architecture is improved by adding densely-connected blocks and using self-attention for better context aggregation than dilated convolutions. We propose a new training loss called phase constrained magnitude (PCM) loss, which is measured for not only the enhanced speech but also the removed noise. It helps to improve phase enhancement, and as a result, obtains better SNR improvements and removes an undesired artifact introduced by the spectral magnitude loss. Next, we systematically investigate the cross-corpus generalization of DNN based speech enhancement. We observe that DNNs suffer from a corpus fitting problem, where a DNN trained on one corpus fails to generalize to other corpora. We propose several techniques, such as channel normalization, a smaller frame shift, and a more comprehensive training corpus, to improve cross-corpus generalization. To further elevate cross-corpus generalization, we propose a novel attentive recurrent network (ARN) for time-domain speech enhancement. The key aspects of ARN include RNN, self-attention, a smaller frame shift, and a larger training corpus. ARN exhibits superior speech enhancement in multiple tasks. The causal version of ARN is the first-ever system that is trained in a speaker-, noise-, and corpus-independent way and exhibits substantial intelligibility improvements for both normal-hearing and hearing-impaired listeners in low SNR conditions. Finally, we propose a novel training framework called Attentive Training for supervised speech enhancement that can remove not only noise but also interfering speech. The main idea of attentive training is to attend to the stream of a single speaker in the mixture, i.e. over the speech signals of other talkers and background noise. A DNN model is trained to sequentially attend to (extract) the speech of the first speaker and ignore the rest, where onset time of the first speaker is used as an intrinsic mechanism for speaker selection. Attentive training outperforms widely used permutation invariant training for speaker separation.

Book Speech and Computer

    Book Details:
  • Author : Alexey Karpov
  • Publisher : Springer Nature
  • Release : 2023-12-23
  • ISBN : 303148309X
  • Pages : 657 pages

Download or read book Speech and Computer written by Alexey Karpov and published by Springer Nature. This book was released on 2023-12-23 with total page 657 pages. Available in PDF, EPUB and Kindle. Book excerpt: The two-volume proceedings set LNAI 14338 and 14339 constitutes the refereed proceedings of the 25th International Conference on Speech and Computer, SPECOM 2023, held in Dharwad, India, during November 29–December 2, 2023. The 94 papers included in these proceedings were carefully reviewed and selected from 174 submissions. They focus on all aspects of speech science and technology: ​automatic speech recognition; computational paralinguistics; digital signal processing; speech prosody; natural language processing; child speech processing; speech processing for medicine; industrial speech and language technology; speech technology for under-resourced languages; speech analysis and synthesis; speaker and language identification, verification and diarization.

Book Audio Source Separation

Download or read book Audio Source Separation written by Shoji Makino and published by Springer. This book was released on 2018-03-01 with total page 389 pages. Available in PDF, EPUB and Kindle. Book excerpt: This book provides the first comprehensive overview of the fascinating topic of audio source separation based on non-negative matrix factorization, deep neural networks, and sparse component analysis. The first section of the book covers single channel source separation based on non-negative matrix factorization (NMF). After an introduction to the technique, two further chapters describe separation of known sources using non-negative spectrogram factorization, and temporal NMF models. In section two, NMF methods are extended to multi-channel source separation. Section three introduces deep neural network (DNN) techniques, with chapters on multichannel and single channel separation, and a further chapter on DNN based mask estimation for monaural speech separation. In section four, sparse component analysis (SCA) is discussed, with chapters on source separation using audio directional statistics modelling, multi-microphone MMSE-based techniques and diffusion map methods. The book brings together leading researchers to provide tutorial-like and in-depth treatments on major audio source separation topics, with the objective of becoming the definitive source for a comprehensive, authoritative, and accessible treatment. This book is written for graduate students and researchers who are interested in audio source separation techniques based on NMF, DNN and SCA.

Book New Era for Robust Speech Recognition

Download or read book New Era for Robust Speech Recognition written by Shinji Watanabe and published by Springer. This book was released on 2017-11-10 with total page 0 pages. Available in PDF, EPUB and Kindle. Book excerpt: This book covers the state-of-the-art in deep neural-network-based methods for noise robustness in distant speech recognition applications. It provides insights and detailed descriptions of some of the new concepts and key technologies in the field, including novel architectures for speech enhancement, microphone arrays, robust features, acoustic model adaptation, training data augmentation, and training criteria. The contributed chapters also include descriptions of real-world applications, benchmark tools and datasets widely used in the field. This book is intended for researchers and practitioners working in the field of speech processing and recognition who are interested in the latest deep learning techniques for noise robustness. It will also be of interest to graduate students in electrical engineering or computer science, who will find it a useful guide to this field of research.

Book Spectral Refinements to Speech Enhancement

Download or read book Spectral Refinements to Speech Enhancement written by Werayuth Charoenruengkit and published by . This book was released on 2009 with total page 248 pages. Available in PDF, EPUB and Kindle. Book excerpt: The goal of a speech enhancement algorithm is to remove noise and recover the original signal with as little distortion and residual noise as possible. Most successful real-time algorithms thereof have done in the frequency domain where the frequency amplitude of clean speech is estimated per short-time frame of the noisy signal. The state of-the-art short-time spectral amplitude estimator algorithms estimate the clean spectral amplitude in terms of the power spectral density (PSD) function of the noisy signal. The PSD has to be computed from a large ensemble of signal realizations. However, in practice, it may only be estimated from a finite-length sample of a single realization of the signal. Estimation errors introduced by these limitations deviate the solution from the optimal. Various spectral estimation techniques, many with added spectral smoothing, have been investigated for decades to reduce the estimation errors. These algorithms do not address significantly issue on quality of speech as perceived by a human. This dissertation presents analysis and techniques that offer spectral refinements toward speech enhancement. We present an analytical framework of the effect of spectral estimate variance on the performance of speech enhancement. We use the variance quality factor (VQF) as a quantitative measure of estimated spectra. We show that reducing the spectral estimator VQF reduces significantly the VQF of the enhanced speech. The Autoregressive Multitaper (ARMT) spectral estimate is proposed as a low VQF spectral estimator for use in speech enhancement algorithms. An innovative method of incorporating a speech production model using multiband excitation is also presented as a technique to emphasize the harmonic components of the glottal speech input. The preconditioning of the noisy estimates by exploiting other avenues of information, such as pitch estimation and the speech production model, effectively increases the localized narrow-band signal-to noise ratio (SNR) of the noisy signal, which is subsequently denoised by the amplitude gain. Combined with voicing structure enhancement, the ARMT spectral estimate delivers enhanced speech with sound clarity desirable to human listeners. The resulting improvements in enhanced speech are observed to be significant with both Objective and Subjective measurement.

Book Exploiting Pitch Dynamics for Speech Spectral Estimation Using a Two dimensional Processing Framework

Download or read book Exploiting Pitch Dynamics for Speech Spectral Estimation Using a Two dimensional Processing Framework written by Tianyu Tom Wang and published by . This book was released on 2008 with total page 135 pages. Available in PDF, EPUB and Kindle. Book excerpt: This thesis addresses the problem of obtaining an accurate spectral representation of speech formant structure when the voicing source exhibits a high fundamental frequency. Our work is inspired by auditory perception and physiological modeling studies implicating the use of temporal changes in speech by humans. Specifically, we develop and evaluate signal processing schemes that exploit temporal change of pitch as a basis for high-pitch formant estimation. As part of our development, we assess the source-filter separation capabilities of several two-dimensional processing schemes that utilize both standard spectrographic and auditory-based time-frequency representations. Our methods show quantitative improvements under certain conditions over representations derived from traditional and homomorphic linear prediction. We conclude by highlighting potential benefits of our framework in the particular application of speaker recognition with preliminary results indicating a performance gender-gap closure on subsets of the TIMIT corpus.

Book Speech Enhancement for Non stationary Noise Based on Spectral Processing

Download or read book Speech Enhancement for Non stationary Noise Based on Spectral Processing written by Mads Helle and published by . This book was released on with total page pages. Available in PDF, EPUB and Kindle. Book excerpt:

Book Speech Enhancement in the STFT Domain

Download or read book Speech Enhancement in the STFT Domain written by Jacob Benesty and published by Springer Science & Business Media. This book was released on 2011-09-18 with total page 112 pages. Available in PDF, EPUB and Kindle. Book excerpt: This work addresses this problem in the short-time Fourier transform (STFT) domain. We divide the general problem into five basic categories depending on the number of microphones being used and whether the interframe or interband correlation is considered. The first category deals with the single-channel problem where STFT coefficients at different frames and frequency bands are assumed to be independent. In this case, the noise reduction filter in each frequency band is basically a real gain. Since a gain does not improve the signal-to-noise ratio (SNR) for any given subband and frame, the noise reduction is basically achieved by liftering the subbands and frames that are less noisy while weighing down on those that are more noisy. The second category also concerns the single-channel problem. The difference is that now the interframe correlation is taken into account and a filter is applied in each subband instead of just a gain. The advantage of using the interframe correlation is that we can improve not only the long-time fullband SNR, but the frame-wise subband SNR as well. The third and fourth classes discuss the problem of multichannel noise reduction in the STFT domain with and without interframe correlation, respectively. In the last category, we consider the interband correlation in the design of the noise reduction filters. We illustrate the basic principle for the single-channel case as an example, while this concept can be generalized to other scenarios. In all categories, we propose different optimization cost functions from which we derive the optimal filters and we also define the performance measures that help analyzing them.

Book Speech Enhancement Using Microphone Arrays

Download or read book Speech Enhancement Using Microphone Arrays written by Siow Yong Low and published by . This book was released on 2005 with total page 396 pages. Available in PDF, EPUB and Kindle. Book excerpt: To further improve suppression capability, spectral domain processing is also exploited to promote noise and echo suppression as it serves to compensate for the temporal linear filtering under the presence of non-linearities. Results in a real duplex hands-free situation verify the noise and echo suppression capability of the proposed spatio-temporal-spectral processor in both noisy non double-talk and double-talk scenarios even with only two elements. The proposed structure maintains the "blind attributes" and yet achieves the suppression capability of a beamformer. All in all, this thesis has unified the best of the three domains (spatial, temporal and spectral) under one roof to synergetically enhance speech signal in adverse environments.