§ 瀏覽學位論文書目資料
系統識別號 U0020-2507201421495400
論文名稱(中文) 門檻值去噪法於調變頻譜之強健性語音辨識研究
論文名稱(英文) A Study of Threshold Denoising on Modulation Spectrum for Robust Speech Recognition
校院名稱 國立暨南國際大學
系所名稱(中文) 電機工程學系
系所名稱(英文) Electrical Engineering
學年度 102
學期 2
出版年 103
研究生(中文) 程彥誌
研究生(英文) Yen-Chih Cheng
學號 101323558
學位類別 碩士
語言別 繁體中文
口試日期 2014-06-25
論文頁數 75頁
口試委員 委員 - 莊家峰
指導教授 - 吳俊德
共同指導教授 - 洪志偉
關鍵字(中) 強健性語音辨識
門檻值去噪
離散餘弦轉換
離散傅立葉轉換
調變頻譜
關鍵字(英) robust speech recognition
threshold denoising
discrete cosine transform
discrete Fourier transform
modulation spectrum
學科別分類 工程學
中文摘要
本論文提出了一個新的雜訊強健性技術來增加語音特徵在雜訊環境下的辨識率。在這個提出的演算法中,時間序列域的特徵會藉由DCT 或是DFT 轉換到各自的頻域,接著利用門檻值來消去較小的部分,最後再把特徵從調變頻譜轉回時間序列域得到新的特徵。這個方法具有兩個優點,第一個是整個補償過程屬於非監督式,不需要額外關於噪聲的資訊;第二點,門檻值的設定非常彈性,並非只有一種可以選擇。透過國際通用Aurora-2 連續數字語料庫和其結果顯示,提出的方法在經過任何特徵預處理的統計正規法上都可以帶來顯著的辨識率提升,如CMVN、MVA、CGN和HEQ。DFT 的實驗結果普遍都比DCT 較好,但我們更進一步發現,使用DCT 的方法中,只需要補償低頻部分就能得到跟補償全頻相似甚至更好的效能,因此不論是使用DCT 還是DFT 的方法都十分具有利用的價值。
英文摘要
This paper presents a novel noise robustness algorithm to enhance speech features in noisy speech recognition. In the presented algorithm, the temporal speech feature sequence
is first converted to its spectrum via discrete cosine transform (DCT) or discrete Fourier transform (DFT), and then the DCT or DFT-based spectrum is compensated by a thresholding function in order to further shrink the smaller portion. Finally, the updated spectrum is converted back to the temporal domain to obtain the new feature sequence. The
method have two advantages: The first is that the overall compensation process is unsupervised that no information about noise in speech signals is required. The second is
that the used threshold can be decided with various optimization criteria flexibly.The experiment evaluation performed on the Aurora-2 connected digit database and
task reveals that the presented methods can provide significant improvement in recognition accuracy to the speech features pre-processed by any of the statistics normalization algorithms, including cepstral mean and variance normalization (CMVN), CMVN plus ARMA filtering (MVA), cepstral gain normalization (CGN) and histogram equalization (HEQ). The DFT-based thresholding methods achieve better performance than the DCT-based ones, but we further showed that, using the DCT-based methods, simply
compensating the low frequency portion gives similar performance on a par with that achieved by compensation over the entire frequency band. As a result, both the DCT- and
DFT-based compensation methods are quite effective in enhancing noise robustness of speech features.
論文目次
目次
誌謝……………………………………………………………………………………….…i
摘要………………………………………………………………………………...……….ii
Abstract…………………………………………………………………………………….iii
目次…………………………………………………………………………………….…...v
圖目次…………………………………………………………………………….....…….vii
表目次……………………………………………………………..…………..……...…..viii
第一章	簡介
1.1 研究動機……………………………………..…………………………….……..1
1.2 研究方向……………………………………..…………………………….……..2
1.3 章節大綱………………………………………..………….……………….…….4
第二章	語音特徵參數擷取與語音處理技術介紹
2.1 語音特徵參數截取……………………………..……………….………………..5
2.2 語音處理技術介紹
2.2.1 語音增強……………………………………..………….………………….7
	   2.2.2 強健性語音特徵擷取………………………………………………………9
第三章	頻譜門檻值去噪法
3.1 研究理論……………………………………..……………………………...…..14
3.2 離散餘弦轉換之頻譜的門檻值去噪法……………………………………...…16
3.3 離散傅立葉轉換之頻譜的門檻值去噪法……………………………………...20
3.4 初步實驗結果…………………………………………..…………………….…22
第四章	辨識實驗環境設定及基礎實驗結果
    4.1 實驗環境與架構設定……………………………………………………….…..25
    4.2 語音聲學模型的建立…………………………………………….……………..27
    4.3 基礎實驗辨識結果…………………………………………………….………..28
第五章	實驗結果與分析討論
5.1強健性語音特徵法之基礎實驗結果………..…………………………………..29
5.2小波門檻值去噪法之實驗結果………………………...……………………….32
5.3離散餘弦轉換之頻譜門檻值去噪法之實驗結果
   5.3.1發展集實驗獲取參數設定值…………………………...…………………35
5.3.2測試集實驗結果………………………………….…………………..……36
5.4離散傅立葉轉換之頻譜門檻值去噪法
       5.4.1發展集實驗獲取參數值設定……………….……………..………………41
   5.4.2 DFT-TD作用於頻譜強度之實驗結果……………………………….……44
       5.4.3 DFT-TD作用於頻譜實部之實驗結果……………………………….……49
       5.4.4 DFT-TD作用於頻譜虛部之實驗結果……………………………….……54
       5.4.5 DFT-TD作用於頻譜實部與虛部之實驗結果……………………….……59
5.5 實驗結果綜合分析與討論……………………………………...………………64
第六章	結論與未來展望…………………………………………………………………72
參考文獻……………………………………………………………………………..……73

圖目次
圖1.1:語音特徵向量其空間處理及時間序列處理之示意圖……………………………3
圖2.1:梅爾倒頻譜係數擷取之流程圖……………………….…………....………..…….5
圖2.2:TEO於語音之DCT係數之增強法之流程圖……………………….……………..8
圖2.3:Sigmoid函數公式及其輸入輸出圖……………….…….………..…………...….11
圖2.4:小波處理示意圖……………………….…………...….……….….………..…….13
圖3.1:JPEG流程圖…………….……………………..………………...........…...............14
圖3.2:亮度量化矩陣……………….………. …………….……...……………….….….15
圖3.3:離散餘弦轉換之頻譜門檻值去噪法流程圖……………….………………....….16
圖3.4:軟式和硬式門檻之比較…………….………………….………. …...…….….….19
圖3.5:門檻化處理一複數強度而保留相位的流程示意圖……….….…………...…….20
圖3.6:離散傅立葉轉換之頻譜門檻值去噪法流程圖…………….…………………….21
圖3.7:MFCC baseline第一維特徵之功率頻譜密度…………………………………….22
圖3.8:CMVN第一維特徵之功率頻譜密度……………………………….…………….23
圖3.9:CMVN plus DCT-TDstd第一維特徵之功率頻譜密度…………………...……….23
圖3.10:CMVN plus DCT-TDTEO第一維特徵之功率頻譜密度…………………………24
圖5.1:四種強健技術之總平均辨識率(%)比較…………….……………...........….…...31
圖5.2:各種強健法之總平均辨識率(%)比較…………….………………...........…...….34
圖5.3:DCT-TD相關方法之辨識率(%)…………….………………………………....….69
圖5.4:處理頻譜強度之DFT-TD相關方法之辨識率(%)…………….………………….70
圖5.5:處理頻譜實部之DFT-TD相關方法之辨識率(%)…………….………………….70
圖5.6:處理頻譜虛部之DFT-TD相關方法之辨識率(%)…………….………………….71
圖5.7:處理頻譜實部與虛部之DFT-TD相關方法之辨識率(%)…………….………….71


表目次
表2.1:語音特徵離群值分佈所占比例(%)………………………………………………10
表4.1:Aurora-2資料庫之乾淨訓練環境與測試語料的相關資訊……...……...……….26
表4.2:本論文所使用之語音參數設定…………………………………………………..27
表4.3:基礎實驗之辨識率(%)……………………………………………………………28
表5.1:CMVN之辨識率(%)………………………………………………………………29
表5.2:MVA之辨識率(%)…………………………………………………………...........30
表5.3:CGN之辨識率(%)…………………………………………………………............30
表5.4:HEQ之辨識率(%)…………………………………………………………………31
表5.5:WTD作用於CMVN預處理特徵所得之辨識率(%)……………………………..32
表5.6:WTD作用於MVA預處理特徵所得之辨識率(%)……………………………….33
表5.7:WTD作用於CGN預處理特徵所得之辨識率(%)……………………………….33
表5.8:WTD作用於HEQ預處理特徵所得之辨識率(%)………………………………..34
表5.9:DCT-TDstd與DCT-TDTEO在不同α和β參數值設定下於發展集所得到的辨
識率(%)………………………...………………….……………………………35
表5.10:CMVN+ DCT-TDstd之辨識率(%)……………………………...………………..37
表5.11:CMVN+ DCT-TDTEO之辨識率(%)………….………………..………...………..37
表5.12:MVA+ DCT-TDstd之辨識率(%)………………………………………………….38
表5.13:MVA+ DCT-TDTEO之辨識率(%)………………………………………….……..38
表5.14:CGN+ DCT-TDstd之辨識率(%)………………………………………………….39
表5.15:CGN+ DCT-TDTEO之辨識率(%)……………………………………...…..……..39
表5.16:HEQ+ DCT-TDstd之辨識率(%)……………………………………………….…40
表5.17:HEQ+ DCT-TDTEO之辨識率(%)……………………………………………...…40

表5.18:DFT-TDstd與DFT-TDTEO在不同α和β參數值設定下於發展集之
        CMVN預處理特徵所得到的辨識率(%)…………………………………...…42
表5.19:DFT-TDstd與DFT-TDTEO在不同α和β參數值設定下於發展集之
        MVA預處理特徵所得到的辨識率(%)…………………………………..……42
表5.20:DFT-TDstd與DFT-TDTEO在不同α和β參數值設定下於發展集之
        CGN預處理特徵所得到的辨識率(%)…………………………...……………43
表5.21:DFT-TDstd與DFT-TDTEO在不同α和β參數值設定下於發展集之
        HEQ預處理特徵所得到的辨識率(%)……………………………………...…43
表5.22:CMVN搭配DFT-TDstd、處理頻譜強度所得之辨識率(%)………..……………45
表5.23:CMVN搭配DFT-TDTEO、處理頻譜強度所得之辨識率(%)……………………45
表5.24:MVA搭配DFT-TDstd、處理頻譜強度所得之辨識率(%)…………………..……46
表5.25:MVA搭配DFT-TDTEO、處理頻譜強度所得之辨識率(%)………………………46
表5.26:CGN搭配DFT-TDstd、處理頻譜強度所得之辨識率(%)…………………..……47
表5.27:CGN搭配DFT-TDTEO、處理頻譜強度所得之辨識率(%)………………………47
表5.28:HEQ搭配DFT-TDstd、處理頻譜強度所得之辨識率(%)………………………..48
表5.29:HEQ搭配DFT-TDTEO、處理頻譜強度所得之辨識率(%)………………………48
表5.30:CMVN搭配DFT-TDstd、處理頻譜實部所得之辨識率(%)…………………..…50
表5.31:CMVN搭配DFT-TDTEO、處理頻譜實部所得之辨識率(%)……………………50
表5.32:MVA搭配DFT-TDstd、處理頻譜實部所得之辨識率(%)…………………..……51
表5.33:MVA搭配DFT-TDTEO、處理頻譜實部所得之辨識率(%)………………………51
表5.34:CGN搭配DFT-TDstd、處理頻譜實部所得之辨識率(%)…………………..……52
表5.35:CGN搭配DFT-TDTEO、處理頻譜實部所得之辨識率(%)………………………52
表5.36:HEQ搭配DFT-TDstd、處理頻譜實部所得之辨識率(%)……………………..…53
表5.37:HEQ搭配DFT-TDTEO、處理頻譜實部所得之辨識率(%)………………………53
表5.38:CMVN搭配DFT-TDstd、處理頻譜虛部所得之辨識率(%)…………………..…55
表5.39:CMVN搭配DFT-TDTEO、處理頻譜虛部所得之辨識率(%)……………………55
表5.40:MVA搭配DFT-TDstd、處理頻譜虛部所得之辨識率(%)……………………..…56
表5.41:MVA搭配DFT-TDTEO、處理頻譜虛部所得之辨識率(%)………………………56
表5.42:CGN搭配DFT-TDstd、處理頻譜虛部所得之辨識率(%)……………………..…57
表5.43:CGN搭配DFT-TDTEO、處理頻譜虛部所得之辨識率(%)………………………57
表5.44:HEQ搭配DFT-TDstd、處理頻譜虛部所得之辨識率(%)……………………..…58
表5.45:HEQ搭配DFT-TDTEO、處理頻譜虛部所得之辨識率(%)………………………58
表5.46:CMVN搭配DFT-TDstd、處理頻譜實部與虛部所得之辨識率(%)……………..60
表5.47:CMVN搭配DFT-TDTEO、處理頻譜實部與虛部所得之辨識率(%)……………60
表5.48:MVA搭配DFT-TDstd、處理頻譜實部與虛部所得之辨識率(%)…………..……61
表5.49:MVA搭配DFT-TDTEO、處理頻譜實部與虛部所得之辨識率(%)………………61
表5.50:CGN搭配DFT-TDstd、處理頻譜實部與虛部所得之辨識率(%)…………..……62
表5.51:CGN搭配DFT-TDTEO、處理頻譜實部與虛部所得之辨識率(%)………………62
表5.52:HEQ搭配DFT-TDstd、處理頻譜實部與虛部所得之辨識率(%)…………..……63
表5.53:HEQ搭配DFT-TDTEO、處理頻譜實部與虛部所得之辨識率(%)………………63
表5.54:CMVN與其合成之方法所得之辨識率(%)………………………………..……67
表5.55:MVA與其合成之方法所得之辨識率(%)…………………………….…………67
表5.56:CGN與其合成之方法所得之辨識率(%)……………………………………..…68
表5.57:HEQ與其合成之方法所得之辨識率(%)……………………………………..…68
表5.58:DCT-TD處理頻譜低頻成分所得之辨識率(%)…………………………………69
參考文獻
[1] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction”, IEEE
Transactions on Acoustics, Speech and Signal Processing, 27(2), pp. 113–120, 1979.
[2] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speech corrupted by
acoustic noise”, IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pp. 208-211, 1979.
[3] C. Plapous, C. Marro, and P. Scalart, “Improved signal-to-noise ratio estimation for
speech enhancement”, IEEE Transactions on Audio, Speech and Language Processing,
14(6), pp. 2098–2108, 2006.
[4] J. C. Goswami and A.K. Chan, “Fundamentals of wavelets: Theory, algorithms,and
applications”, Second Edition, 2010.
[5] Y. Wang and M. Brookes, “Speech enhancement using a robust Kalman filter
post-processor in the modulation domain”, IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2013.
[6] T. F. Sanam and H. Imtiaz, “A DCT-based noisy speech enhancement method using
Teager energy operator”, the 5th International Conference on Knowledge and Smart
Technology, 2013.
[7] S. Furui, “Cepstral analysis technique for automatic speaker verification”, IEEE
Transactions on Acoustics, Speech and Signal Processing, vol. 29, issue: 2, Apr 1981.
[8] S. Tibrewala and H. Hermansky, “Multiband and adaptation approaches to robust
speech recognition”, the 1997 Eurospeech Conference on Speech Communications and
Technology, 1997.
[9] C. Chen and J. Bilmes, “MVA processing of speech features”, IEEE Transactions on
Audio, Speech and Language Processing, 15, pp. 257-270, 2007.
[10] S. Yoshizawa et al., “Cepstral gain normalization for noise robust speech recognition”,
IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2004.
[11] F. Hilger and H. Ney, “Quantile based histogram equalization for noise robust large
vocabulary speech recognition”, IEEE Transactions on Audio, Speech and Language
Processing, 14, pp. 845-854, 2006.
[12] H. T. Fang et al., “Robustifying cepstral features by mitigating the outlier effect for
noisy speech recognition”, International Conference on Fuzzy Systems and
Knowledge Discovery, 2013.
[13] R. A. Gopinath, “Maximum likelihood modeling with Gaussian distributions for
classification”, IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 1998.
[14] C. Rathinavelu and Li Deng, “HMM-based speech recognition using state-dependent,
discriminatively derived transforms on Mel-warped DFT features”, IEEE
Transactions on Speech and Audio Processing, 243-256, May, 1997.
[15] Li Deng et al., “Distributed speech processing in MiPad’s multimodal user interface”,
IEEE Transactions on Speech and Audio Processing, 605-619, November 2002.
[16] C. J. Leggester and P. C. Woodland, “Maximum likelihood linear regression for
speaker adaptation of continuous density HMMs”, Computer Speech and
Language, 171-186, 1995.
[17] M. J. F. Gales and S. J. Young, “Cepstral parameter compensation for HMM
recognition in noise”, Speech Communication, 12:231-239, 1993.
[18] A. Acero, L. Deng, T. Kristjansson, and J. Zhang, “HMM adaptation using vector
Taylor series for noisy speech recognition”, the 6th International Conference of
Spoken Language Processing, 869-872, 2000.
[19] J. F. Kaiser, “On a simple algorithm to calculate the ‘energy’ of a signal”, IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.
381-384, 1990.
[20] H. J. Hsieh, J. W. Hung and B. Chen, “Exploring joint equalization of
spatial-temporal contextual statistics of speech features for robust speech recognition”,
the 13th Annual Conference of the International Speech Communication Association,
2012.
[21] S. G. Chang, B. Yu and M. Vetterli, “Adaptive wavelet thresholding for image
denoising and compression”, IEEE Transactions on Image Processing, 9, pp.
1532-1546, 2000.
[22] I C. Lu, “Exploiting wavelet de-noising in the temporal sequences of features for
robust speech recognition”, master thesis, National Chi Nan University, 2011.
[23] G. Zhou, J. H. L. Hansen, and J. F. Kaiser, “Nonlinear feature based classification of
speech under stress”, IEEE Transactions on Speech and Audio Processing, vol. 9, pp.
201-216, 2001.
[24] H. Teager and S. Teager, “Evidence for nonlinear production mechanisms in the vocal
tract”, Speech Production and Speech Modeling, NATO advance Study Institute, vol.
55, pp. 241-261,1990.
[25] H. G. Hirsch and D. Pearce, “The AURORA experimental framework for the
performance evaluations of speech recognition systems under noisy conditions”, the
2000 Automatic Speech Recognition: "Challenges for the new Millennium", 2000.
[26] Y. C. Cheng, J. S. Lin and J. W. Hung, “Leveraging threshold denoising on
DCT-based modulation spectrum for noise robust speech recognition”, the 11th IEEE
International Conference on Control & Automation (ICCA), 2014.
論文全文使用權限
校內
同意電子論文全文授權校園內公開
校內書目立即公開
校外
同意授權資料庫廠商瀏覽/列印電子全文服務, 且權利金通知本人領取。
校外電子論文立即公開

若您有任何疑問,請與我們聯繫。

聯絡電話:(049)2910960分機4371