Improving Speech Recognition Rate through Analysis Parameters
DOI:
https://doi.org/10.2478/ecce-2014-0009Keywords:
Computers and information processing, Speech analysis, Speech recognition, Speech enhancementAbstract
Speech signal is redundant and non-stationary by nature. Because of vocal tract inertness these variations are not very rapid and the signal can be considered as stationary in short segments. It is presumed that in short-time magnitude spectrum the most distinct information of speech is contained. This is the main reason for speech signal analysis in frame-by-frame manner. The analyzed speech signal is segmented into overlapping segments (so-called frames) for this purpose. Segments of 15-25 ms with the overlap of 10-15 ms are used usually.References
Z. Jiang, H. Huang, S. Yang, S. Lu, and Z. Hao, “Acoustic Feature Comparison of MFCC and CZT-Based Cepstrum for Speech Recognition,” in Proceedings of 5th International Conference on Natural Computation, 2009, pp. 55-59.
L. Deng, J. Wu, J. Droppo, and A. Acero, “Analysis and comparison of two speech feature extraction/compensation algorithms,” IEEE Signal Processing Letters, vol. 12, no. 6, pp. 477-480, Jun. 2005.
S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357-366, Aug. 1980.
J. Pelecanos, S. Slomka, and S. Sridharan, “Enhancing automatic speaker identification using phoneme clustering and frame based parameter and frame size selection,” in Proceedings of the 5th International Symposium on Signal Processing and its Applications ISSPA99 (IEEE Cat. No.99EX359), vol. 2, pp. 633-636.
K. Paliwal and K. Wojcicki, “Effect of Analysis Window Duration on Speech Intelligibility,” IEEE Signal Processing Letters, vol. 15, pp. 785-788, 2008.
L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, 1st ed. Prentice Hall, 1993, p. 496.
M. Goyani, N. Dave, and N. M. Patel, “Performance Analysis of Lip Synchronization Using LPC, MFCC and PLP Speech Parameters,” in Proceedings of International Conference on Computational Intelligence and Communication Networks, 2010, pp. 582-587.
M. Suzuki, T. Yoshioka, S. Watanabe, N. Minematsu, and K. Hirose, “MFCC enhancement using joint corrupted and noise feature space for highly non-stationary noise environments,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 4109-4112.
D. Yu, L. Deng, J. Droppo, J. Wu, Y. Gong, and A. Acero, “Robust Speech Recognition Using a Cepstral Minimum-Mean-Square-Error- Motivated Noise Suppressor,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 5, pp. 1061-1070, Jul. 2008.
T. Kinnunen, R. Saeidi, F. Sedlak, K. A. Lee, J. Sandberg, M. Hansson- Sandsten, and H. Li, “Low-Variance Multitaper MFCC Features: A Case Study in Robust Speaker Verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 7, pp. 1990-2001, Sep. 2012.
O. Gauci, C. J. Debono, and P. Micallef, “A nonlinear feature extraction method for phoneme recognition,” in Proceedings of MELECON 2008 - The 14th IEEE Mediterranean Electrotechnical Conference, 2008, pp. 811-815.
C. Lee, D. Hyun, E. Choi, J. Go, and C. Lee, “Optimizing feature extraction for speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 11, no. 1, pp. 80-87, Jan. 2003.
H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” The Journal of the Acoustical Society of America, vol. 87, no. 4, pp. 1738-1752, 1990.
J. Makhoul, “Linear prediction: A tutorial review,” Proceedings of the IEEE, vol. 63, no. 4, pp. 561-580, 1975.
O. Rioul and M. Vetterli, “Wavelets and signal processing,” IEEE Signal Processing Magazine, vol. 8, no. 4, pp. 14-38, Oct. 1991.
M. Cutajar, E. Gatt, I. Grech, O. Casha, and J. Micallef, “Comparative study of automatic speech recognition techniques,” IET Signal Processing, vol. 7, no. 1, pp. 25-46, Feb. 2013.
A. Salomon, C. Y. Espy-Wilson, and O. Deshmukh, “Detection of speech landmarks: Use of temporal information,” The Journal of the Acoustical Society of America, vol. 115, no. 3, pp. 1296-1305, 2004.
U. H. Yapanel and J. H. L. Hansen, “A New perspective on Feature Extraction for Robust In-Vehicle Speech Recognition,” in ISCA Proceedings: Eurospeech2003, 2003, pp. 1281-1284.
C. Kim and R. M. Stern, “Power function-based power distribution normalization algorithm for robust speech recognition,” in Proceedings of IEEE Workshop on Automatic Speech Recognition & Understanding, 2009, pp. 188-193.
S. Kim, T. Eriksson, H.-G. Kang, and D. H. Youn, “A pitch synchronous feature extraction method for speaker recognition,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. I-405-8.
I. Ding, “Enhancement of speech recognition using a variable-length frame overlapping method,” in Proceedings of International Symposium on Computer, Communication, Control and Automation (3CA), 2010, pp. 375-377.
Q. Zhu and A. Abeer, “On the use of variable frame rate analysis in speech recognition,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100), vol. 3, pp. 1783-1786.
B. Zhu and E. Micheli-Tzanakou, “Nonstationary speech analysis using neural prediction,” IEEE Engineering in Medicine and Biology Magazine, vol. 19, no. 1, pp. 102-105, 2000.
A. Lipeika, J. Lipeikiene, and L. Telksnys, “Development of Isolated Word Speech Recognition System,” Informatica, vol. 13, no. 1, pp. 37-46, 2002.
L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, US. Prentice Hall, 1978, p. 962.
K. K. Paliwal, J. G. Lyons, and K. K. Wojcicki, “Preference for 20-40 ms window duration in speech analysis,” in Proceedings of 4th International Conference on Signal Processing and Communication Systems, 2010, pp. 1-4.
L. R. Rabiner, “On the use of autocorrelation analysis for pitch detection,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 25, no. 1, pp. 24-33, Feb. 1977.
W.-G. Gong, L.-P. Yang, and D. Chen, “Pitch Synchronous Based Feature Extraction for Noise-Robust Speaker Verification,” in Proceedings of Congress on Image and Signal Processing, 2008, pp. 295-298.
G. L. Sarada, T. Nagarajan, and H. A. Murthy, “Multiple frame size and multiple frame rate feature extraction for speech recognition,” in Proceedings of International Conference on Signal Processing and Communications, SPCOM ’04., pp. 592-595.
R. D. Zilca, B. Kingsbury, J. Navratil, and G. N. Ramaswamy, “Pseudo Pitch Synchronous Analysis of Speech With Applications to Speaker Recognition,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 2, pp. 467-478, Mar. 2006.
Z.-H. Tan and B. Lindberg, “Low-Complexity Variable Frame Rate Analysis for Speech Recognition and Voice Activity Detection,” IEEE Journal of Selected Topics in Signal Processing, vol. 4, no. 5, pp. 798-807, Oct. 2010.
C.-S. Jung, M. Y. Kim, and H.-G. Kang, “Selecting Feature Frames for Automatic Speaker Recognition Using Mutual Information,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 6, pp. 1332-1340, Aug. 2010.
Downloads
Published
Issue
Section
License
Copyright (c) 2014 Deividas Eringis, Gintautas Tamulevičius (Author)
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 Unported License.