系統識別號 | U0020-2507201421495400 |
---|---|
論文名稱(中文) | 門檻值去噪法於調變頻譜之強健性語音辨識研究 |
論文名稱(英文) | A Study of Threshold Denoising on Modulation Spectrum for Robust Speech Recognition |
校院名稱 | 國立暨南國際大學 |
系所名稱(中文) | 電機工程學系 |
系所名稱(英文) | Electrical Engineering |
學年度 | 102 |
學期 | 2 |
出版年 | 103 |
研究生(中文) | 程彥誌 |
研究生(英文) | Yen-Chih Cheng |
學號 | 101323558 |
學位類別 | 碩士 |
語言別 | 繁體中文 |
口試日期 | 2014-06-25 |
論文頁數 | 75頁 |
口試委員 |
委員
-
莊家峰
指導教授 - 吳俊德 共同指導教授 - 洪志偉 |
關鍵字(中) |
強健性語音辨識 門檻值去噪 離散餘弦轉換 離散傅立葉轉換 調變頻譜 |
關鍵字(英) |
robust speech recognition threshold denoising discrete cosine transform discrete Fourier transform modulation spectrum |
學科別分類 | 工程學 |
中文摘要 |
本論文提出了一個新的雜訊強健性技術來增加語音特徵在雜訊環境下的辨識率。在這個提出的演算法中,時間序列域的特徵會藉由DCT 或是DFT 轉換到各自的頻域,接著利用門檻值來消去較小的部分,最後再把特徵從調變頻譜轉回時間序列域得到新的特徵。這個方法具有兩個優點,第一個是整個補償過程屬於非監督式,不需要額外關於噪聲的資訊;第二點,門檻值的設定非常彈性,並非只有一種可以選擇。透過國際通用Aurora-2 連續數字語料庫和其結果顯示,提出的方法在經過任何特徵預處理的統計正規法上都可以帶來顯著的辨識率提升,如CMVN、MVA、CGN和HEQ。DFT 的實驗結果普遍都比DCT 較好,但我們更進一步發現,使用DCT 的方法中,只需要補償低頻部分就能得到跟補償全頻相似甚至更好的效能,因此不論是使用DCT 還是DFT 的方法都十分具有利用的價值。 |
英文摘要 |
This paper presents a novel noise robustness algorithm to enhance speech features in noisy speech recognition. In the presented algorithm, the temporal speech feature sequence is first converted to its spectrum via discrete cosine transform (DCT) or discrete Fourier transform (DFT), and then the DCT or DFT-based spectrum is compensated by a thresholding function in order to further shrink the smaller portion. Finally, the updated spectrum is converted back to the temporal domain to obtain the new feature sequence. The method have two advantages: The first is that the overall compensation process is unsupervised that no information about noise in speech signals is required. The second is that the used threshold can be decided with various optimization criteria flexibly.The experiment evaluation performed on the Aurora-2 connected digit database and task reveals that the presented methods can provide significant improvement in recognition accuracy to the speech features pre-processed by any of the statistics normalization algorithms, including cepstral mean and variance normalization (CMVN), CMVN plus ARMA filtering (MVA), cepstral gain normalization (CGN) and histogram equalization (HEQ). The DFT-based thresholding methods achieve better performance than the DCT-based ones, but we further showed that, using the DCT-based methods, simply compensating the low frequency portion gives similar performance on a par with that achieved by compensation over the entire frequency band. As a result, both the DCT- and DFT-based compensation methods are quite effective in enhancing noise robustness of speech features. |
論文目次 |
目次 誌謝……………………………………………………………………………………….…i 摘要………………………………………………………………………………...……….ii Abstract…………………………………………………………………………………….iii 目次…………………………………………………………………………………….…...v 圖目次…………………………………………………………………………….....…….vii 表目次……………………………………………………………..…………..……...…..viii 第一章 簡介 1.1 研究動機……………………………………..…………………………….……..1 1.2 研究方向……………………………………..…………………………….……..2 1.3 章節大綱………………………………………..………….……………….…….4 第二章 語音特徵參數擷取與語音處理技術介紹 2.1 語音特徵參數截取……………………………..……………….………………..5 2.2 語音處理技術介紹 2.2.1 語音增強……………………………………..………….………………….7 2.2.2 強健性語音特徵擷取………………………………………………………9 第三章 頻譜門檻值去噪法 3.1 研究理論……………………………………..……………………………...…..14 3.2 離散餘弦轉換之頻譜的門檻值去噪法……………………………………...…16 3.3 離散傅立葉轉換之頻譜的門檻值去噪法……………………………………...20 3.4 初步實驗結果…………………………………………..…………………….…22 第四章 辨識實驗環境設定及基礎實驗結果 4.1 實驗環境與架構設定……………………………………………………….…..25 4.2 語音聲學模型的建立…………………………………………….……………..27 4.3 基礎實驗辨識結果…………………………………………………….………..28 第五章 實驗結果與分析討論 5.1強健性語音特徵法之基礎實驗結果………..…………………………………..29 5.2小波門檻值去噪法之實驗結果………………………...……………………….32 5.3離散餘弦轉換之頻譜門檻值去噪法之實驗結果 5.3.1發展集實驗獲取參數設定值…………………………...…………………35 5.3.2測試集實驗結果………………………………….…………………..……36 5.4離散傅立葉轉換之頻譜門檻值去噪法 5.4.1發展集實驗獲取參數值設定……………….……………..………………41 5.4.2 DFT-TD作用於頻譜強度之實驗結果……………………………….……44 5.4.3 DFT-TD作用於頻譜實部之實驗結果……………………………….……49 5.4.4 DFT-TD作用於頻譜虛部之實驗結果……………………………….……54 5.4.5 DFT-TD作用於頻譜實部與虛部之實驗結果……………………….……59 5.5 實驗結果綜合分析與討論……………………………………...………………64 第六章 結論與未來展望…………………………………………………………………72 參考文獻……………………………………………………………………………..……73 圖目次 圖1.1:語音特徵向量其空間處理及時間序列處理之示意圖……………………………3 圖2.1:梅爾倒頻譜係數擷取之流程圖……………………….…………....………..…….5 圖2.2:TEO於語音之DCT係數之增強法之流程圖……………………….……………..8 圖2.3:Sigmoid函數公式及其輸入輸出圖……………….…….………..…………...….11 圖2.4:小波處理示意圖……………………….…………...….……….….………..…….13 圖3.1:JPEG流程圖…………….……………………..………………...........…...............14 圖3.2:亮度量化矩陣……………….………. …………….……...……………….….….15 圖3.3:離散餘弦轉換之頻譜門檻值去噪法流程圖……………….………………....….16 圖3.4:軟式和硬式門檻之比較…………….………………….………. …...…….….….19 圖3.5:門檻化處理一複數強度而保留相位的流程示意圖……….….…………...…….20 圖3.6:離散傅立葉轉換之頻譜門檻值去噪法流程圖…………….…………………….21 圖3.7:MFCC baseline第一維特徵之功率頻譜密度…………………………………….22 圖3.8:CMVN第一維特徵之功率頻譜密度……………………………….…………….23 圖3.9:CMVN plus DCT-TDstd第一維特徵之功率頻譜密度…………………...……….23 圖3.10:CMVN plus DCT-TDTEO第一維特徵之功率頻譜密度…………………………24 圖5.1:四種強健技術之總平均辨識率(%)比較…………….……………...........….…...31 圖5.2:各種強健法之總平均辨識率(%)比較…………….………………...........…...….34 圖5.3:DCT-TD相關方法之辨識率(%)…………….………………………………....….69 圖5.4:處理頻譜強度之DFT-TD相關方法之辨識率(%)…………….………………….70 圖5.5:處理頻譜實部之DFT-TD相關方法之辨識率(%)…………….………………….70 圖5.6:處理頻譜虛部之DFT-TD相關方法之辨識率(%)…………….………………….71 圖5.7:處理頻譜實部與虛部之DFT-TD相關方法之辨識率(%)…………….………….71 表目次 表2.1:語音特徵離群值分佈所占比例(%)………………………………………………10 表4.1:Aurora-2資料庫之乾淨訓練環境與測試語料的相關資訊……...……...……….26 表4.2:本論文所使用之語音參數設定…………………………………………………..27 表4.3:基礎實驗之辨識率(%)……………………………………………………………28 表5.1:CMVN之辨識率(%)………………………………………………………………29 表5.2:MVA之辨識率(%)…………………………………………………………...........30 表5.3:CGN之辨識率(%)…………………………………………………………............30 表5.4:HEQ之辨識率(%)…………………………………………………………………31 表5.5:WTD作用於CMVN預處理特徵所得之辨識率(%)……………………………..32 表5.6:WTD作用於MVA預處理特徵所得之辨識率(%)……………………………….33 表5.7:WTD作用於CGN預處理特徵所得之辨識率(%)……………………………….33 表5.8:WTD作用於HEQ預處理特徵所得之辨識率(%)………………………………..34 表5.9:DCT-TDstd與DCT-TDTEO在不同α和β參數值設定下於發展集所得到的辨 識率(%)………………………...………………….……………………………35 表5.10:CMVN+ DCT-TDstd之辨識率(%)……………………………...………………..37 表5.11:CMVN+ DCT-TDTEO之辨識率(%)………….………………..………...………..37 表5.12:MVA+ DCT-TDstd之辨識率(%)………………………………………………….38 表5.13:MVA+ DCT-TDTEO之辨識率(%)………………………………………….……..38 表5.14:CGN+ DCT-TDstd之辨識率(%)………………………………………………….39 表5.15:CGN+ DCT-TDTEO之辨識率(%)……………………………………...…..……..39 表5.16:HEQ+ DCT-TDstd之辨識率(%)……………………………………………….…40 表5.17:HEQ+ DCT-TDTEO之辨識率(%)……………………………………………...…40 表5.18:DFT-TDstd與DFT-TDTEO在不同α和β參數值設定下於發展集之 CMVN預處理特徵所得到的辨識率(%)…………………………………...…42 表5.19:DFT-TDstd與DFT-TDTEO在不同α和β參數值設定下於發展集之 MVA預處理特徵所得到的辨識率(%)…………………………………..……42 表5.20:DFT-TDstd與DFT-TDTEO在不同α和β參數值設定下於發展集之 CGN預處理特徵所得到的辨識率(%)…………………………...……………43 表5.21:DFT-TDstd與DFT-TDTEO在不同α和β參數值設定下於發展集之 HEQ預處理特徵所得到的辨識率(%)……………………………………...…43 表5.22:CMVN搭配DFT-TDstd、處理頻譜強度所得之辨識率(%)………..……………45 表5.23:CMVN搭配DFT-TDTEO、處理頻譜強度所得之辨識率(%)……………………45 表5.24:MVA搭配DFT-TDstd、處理頻譜強度所得之辨識率(%)…………………..……46 表5.25:MVA搭配DFT-TDTEO、處理頻譜強度所得之辨識率(%)………………………46 表5.26:CGN搭配DFT-TDstd、處理頻譜強度所得之辨識率(%)…………………..……47 表5.27:CGN搭配DFT-TDTEO、處理頻譜強度所得之辨識率(%)………………………47 表5.28:HEQ搭配DFT-TDstd、處理頻譜強度所得之辨識率(%)………………………..48 表5.29:HEQ搭配DFT-TDTEO、處理頻譜強度所得之辨識率(%)………………………48 表5.30:CMVN搭配DFT-TDstd、處理頻譜實部所得之辨識率(%)…………………..…50 表5.31:CMVN搭配DFT-TDTEO、處理頻譜實部所得之辨識率(%)……………………50 表5.32:MVA搭配DFT-TDstd、處理頻譜實部所得之辨識率(%)…………………..……51 表5.33:MVA搭配DFT-TDTEO、處理頻譜實部所得之辨識率(%)………………………51 表5.34:CGN搭配DFT-TDstd、處理頻譜實部所得之辨識率(%)…………………..……52 表5.35:CGN搭配DFT-TDTEO、處理頻譜實部所得之辨識率(%)………………………52 表5.36:HEQ搭配DFT-TDstd、處理頻譜實部所得之辨識率(%)……………………..…53 表5.37:HEQ搭配DFT-TDTEO、處理頻譜實部所得之辨識率(%)………………………53 表5.38:CMVN搭配DFT-TDstd、處理頻譜虛部所得之辨識率(%)…………………..…55 表5.39:CMVN搭配DFT-TDTEO、處理頻譜虛部所得之辨識率(%)……………………55 表5.40:MVA搭配DFT-TDstd、處理頻譜虛部所得之辨識率(%)……………………..…56 表5.41:MVA搭配DFT-TDTEO、處理頻譜虛部所得之辨識率(%)………………………56 表5.42:CGN搭配DFT-TDstd、處理頻譜虛部所得之辨識率(%)……………………..…57 表5.43:CGN搭配DFT-TDTEO、處理頻譜虛部所得之辨識率(%)………………………57 表5.44:HEQ搭配DFT-TDstd、處理頻譜虛部所得之辨識率(%)……………………..…58 表5.45:HEQ搭配DFT-TDTEO、處理頻譜虛部所得之辨識率(%)………………………58 表5.46:CMVN搭配DFT-TDstd、處理頻譜實部與虛部所得之辨識率(%)……………..60 表5.47:CMVN搭配DFT-TDTEO、處理頻譜實部與虛部所得之辨識率(%)……………60 表5.48:MVA搭配DFT-TDstd、處理頻譜實部與虛部所得之辨識率(%)…………..……61 表5.49:MVA搭配DFT-TDTEO、處理頻譜實部與虛部所得之辨識率(%)………………61 表5.50:CGN搭配DFT-TDstd、處理頻譜實部與虛部所得之辨識率(%)…………..……62 表5.51:CGN搭配DFT-TDTEO、處理頻譜實部與虛部所得之辨識率(%)………………62 表5.52:HEQ搭配DFT-TDstd、處理頻譜實部與虛部所得之辨識率(%)…………..……63 表5.53:HEQ搭配DFT-TDTEO、處理頻譜實部與虛部所得之辨識率(%)………………63 表5.54:CMVN與其合成之方法所得之辨識率(%)………………………………..……67 表5.55:MVA與其合成之方法所得之辨識率(%)…………………………….…………67 表5.56:CGN與其合成之方法所得之辨識率(%)……………………………………..…68 表5.57:HEQ與其合成之方法所得之辨識率(%)……………………………………..…68 表5.58:DCT-TD處理頻譜低頻成分所得之辨識率(%)…………………………………69 |
參考文獻 |
[1] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction”, IEEE Transactions on Acoustics, Speech and Signal Processing, 27(2), pp. 113–120, 1979. [2] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speech corrupted by acoustic noise”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 208-211, 1979. [3] C. Plapous, C. Marro, and P. Scalart, “Improved signal-to-noise ratio estimation for speech enhancement”, IEEE Transactions on Audio, Speech and Language Processing, 14(6), pp. 2098–2108, 2006. [4] J. C. Goswami and A.K. Chan, “Fundamentals of wavelets: Theory, algorithms,and applications”, Second Edition, 2010. [5] Y. Wang and M. Brookes, “Speech enhancement using a robust Kalman filter post-processor in the modulation domain”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013. [6] T. F. Sanam and H. Imtiaz, “A DCT-based noisy speech enhancement method using Teager energy operator”, the 5th International Conference on Knowledge and Smart Technology, 2013. [7] S. Furui, “Cepstral analysis technique for automatic speaker verification”, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 29, issue: 2, Apr 1981. [8] S. Tibrewala and H. Hermansky, “Multiband and adaptation approaches to robust speech recognition”, the 1997 Eurospeech Conference on Speech Communications and Technology, 1997. [9] C. Chen and J. Bilmes, “MVA processing of speech features”, IEEE Transactions on Audio, Speech and Language Processing, 15, pp. 257-270, 2007. [10] S. Yoshizawa et al., “Cepstral gain normalization for noise robust speech recognition”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2004. [11] F. Hilger and H. Ney, “Quantile based histogram equalization for noise robust large vocabulary speech recognition”, IEEE Transactions on Audio, Speech and Language Processing, 14, pp. 845-854, 2006. [12] H. T. Fang et al., “Robustifying cepstral features by mitigating the outlier effect for noisy speech recognition”, International Conference on Fuzzy Systems and Knowledge Discovery, 2013. [13] R. A. Gopinath, “Maximum likelihood modeling with Gaussian distributions for classification”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1998. [14] C. Rathinavelu and Li Deng, “HMM-based speech recognition using state-dependent, discriminatively derived transforms on Mel-warped DFT features”, IEEE Transactions on Speech and Audio Processing, 243-256, May, 1997. [15] Li Deng et al., “Distributed speech processing in MiPad’s multimodal user interface”, IEEE Transactions on Speech and Audio Processing, 605-619, November 2002. [16] C. J. Leggester and P. C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density HMMs”, Computer Speech and Language, 171-186, 1995. [17] M. J. F. Gales and S. J. Young, “Cepstral parameter compensation for HMM recognition in noise”, Speech Communication, 12:231-239, 1993. [18] A. Acero, L. Deng, T. Kristjansson, and J. Zhang, “HMM adaptation using vector Taylor series for noisy speech recognition”, the 6th International Conference of Spoken Language Processing, 869-872, 2000. [19] J. F. Kaiser, “On a simple algorithm to calculate the ‘energy’ of a signal”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 381-384, 1990. [20] H. J. Hsieh, J. W. Hung and B. Chen, “Exploring joint equalization of spatial-temporal contextual statistics of speech features for robust speech recognition”, the 13th Annual Conference of the International Speech Communication Association, 2012. [21] S. G. Chang, B. Yu and M. Vetterli, “Adaptive wavelet thresholding for image denoising and compression”, IEEE Transactions on Image Processing, 9, pp. 1532-1546, 2000. [22] I C. Lu, “Exploiting wavelet de-noising in the temporal sequences of features for robust speech recognition”, master thesis, National Chi Nan University, 2011. [23] G. Zhou, J. H. L. Hansen, and J. F. Kaiser, “Nonlinear feature based classification of speech under stress”, IEEE Transactions on Speech and Audio Processing, vol. 9, pp. 201-216, 2001. [24] H. Teager and S. Teager, “Evidence for nonlinear production mechanisms in the vocal tract”, Speech Production and Speech Modeling, NATO advance Study Institute, vol. 55, pp. 241-261,1990. [25] H. G. Hirsch and D. Pearce, “The AURORA experimental framework for the performance evaluations of speech recognition systems under noisy conditions”, the 2000 Automatic Speech Recognition: "Challenges for the new Millennium", 2000. [26] Y. C. Cheng, J. S. Lin and J. W. Hung, “Leveraging threshold denoising on DCT-based modulation spectrum for noise robust speech recognition”, the 11th IEEE International Conference on Control & Automation (ICCA), 2014. |
論文全文使用權限 |
若您有任何疑問,請與我們聯繫。
聯絡電話:(049)2910960分機4371