Study of senone-based deep neural network approaches for spoken language recognition

This paper compares different approaches for using deep neural networks (DNNs) trained to predict senone posteriors for the task of spoken language recognition (SLR). These approaches have recently been found to outperform various baseline systems on different datasets, but they have not yet been co...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autor principal: Ferrer, L.
Otros Autores: Lei, Y., McLaren, M., Scheffer, N.
Formato: Capítulo de libro
Lenguaje:Inglés
Publicado: Institute of Electrical and Electronics Engineers Inc. 2016
Acceso en línea:Registro en Scopus
DOI
Handle
Registro en la Biblioteca Digital
Aporte de:Registro referencial: Solicitar el recurso aquí
LEADER 12828caa a22010457a 4500
001 PAPER-16404
003 AR-BaUEN
005 20230518204725.0
008 190411s2016 xx ||||fo|||| 00| 0 eng|d
024 7 |2 scopus  |a 2-s2.0-84957047938 
040 |a Scopus  |b spa  |c AR-BaUEN  |d AR-BaUEN 
100 1 |a Ferrer, L. 
245 1 0 |a Study of senone-based deep neural network approaches for spoken language recognition 
260 |b Institute of Electrical and Electronics Engineers Inc.  |c 2016 
506 |2 openaire  |e Política editorial 
504 |a Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A., Kingsbury, B., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups (2012) IEEE Signal Process. Mag., 29 (6), pp. 82-97. , Nov 
504 |a Dahl, G.E., Yu, D., Deng, L., Acero, A., Context-dependent pretrained deep neural networks for large-vocabulary speech recognition (2012) IEEE Trans. Audio, Speech, Lang. Process., 20 (1), pp. 30-42. , Jan 
504 |a Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P., Front-end factor analysis for speaker verification (2011) IEEE Trans. Audio, Speech, Lang. Process., 19 (4), pp. 788-798. , May 
504 |a Martínez-González, D., Plchot, O., Burget, L., Glembek, O., Matejka, P., Language recognition in ivectors space (2011) Proc. Interspeech, Florence, Italy, , Aug 
504 |a Dehak, N., Torres-Carrasquillo, P.A., Reynolds, D.A., Dehak, R., Language recognition via i-vectors and dimensionality reduction (2011) Proc. Interspeech, Florence, Italy, , Aug 
504 |a Lei, Y., Scheffer, N., Ferrer, L., McLaren, M., A novel scheme for speaker recognition using a phonetically-aware deep neural network (2014) Proc. ICASSP, Florence, Italy, , May 
504 |a Lei, Y., Ferrer, L., Lawson, A., McLaren, M., Scheffer, N., Application of convolutional neural networks to language identification in noisy conditions (2014) Proc. Odyssey'14, Joensuu, Finland, , Jun 
504 |a Kenny, P., Gupta, V., Stafylakis, T., Ouellet, P., Alam, J., Deep neural networks for extracting Baum-Welch statistics for speaker recognition (2014) Proc. Odyssey'14, Joensuu, Finland, , Jun 
504 |a Ferrer, L., Lei, Y., McLaren, M., Scheffer, N., Spoken language recognition based on senone posteriors (2014) Proc. Interspeech, Singapore, , Sep 
504 |a Song, Y., Jiang, B., Bao, Y., Wei, S., Dai, L.-R., I-vector representation based on bottleneck features for language identification (2013) Electron. Lett., 49 (24), pp. 1569-1570 
504 |a Jiang, B., Song, Y., Wei, S., Liu, J.-H., McLoughlin, I.V., Dai, L.-R., Deep bottleneck features for spoken language identification (2014) PLOS One, pp. 1-11. , Jul 
504 |a Matejka, P., Zhang, L., Ng, T., Mallidi, S.H., Glembek, O., Ma, J., Zhang, B., Neural network bottleneck features for language identification (2014) Proc. Odyssey'14, Joensuu, Finland, , Jun 
504 |a Lopez-Moreno, I., Gonzalez-Dominguez, J., Plchot, O., Martnez-González, D., Gonzalez-Rodriguez, J., Moreno P, J., Automatic language identification using deep neural networks (2014) Proc. ICASSP, Florence, Italy, pp. 5337-5341. , May 
504 |a Diez, M., Varona, A., Penagarikano, M., Rodriguez-Fuentes, L.J., Bordel, G., On the use of phone log-likelihood ratios as features in spoken language recognition (2012) Proc. IEEE Workshop Spoken Lang. Technol. (SLT'12), pp. 274-279. , Miami, FL, USA 
504 |a Matejka, P., Schwarz, P., Cernocky, J., Chytil, P., Phonotactic language identification using high quality phoneme recognition (2005) Proc. Interspeech'05 
504 |a Shen, W., Campbell, W., Gleason, T., Reynolds, D., Singer, E., Experiments with lattice-based PPRLM language identification (2006) Proc. Odyssey'06, San Juan, Puerto Rico, , Jun 
504 |a Stolcke, A., Akbacak, M., Ferrer, L., Kajarekar, S., Richey, C., Scheffer, N., Shriberg, E., Improving language recognition with multilingual phone recognition and speaker adaptation transforms (2010) Proc. Odyssey'10, Brno, Czech Republic, , Jun 
504 |a D'Haro, L.F., Glembek, O., Plchot, O., Matejka, P., Soufifar, M., Cordoba, R., Cernocky, J., Phonotactic language recognition using i-vectors and phoneme posteriogram counts (2012) Proc. Interspeech, Portland, OR, USA, , Sep 
504 |a Young, S.J., Odell, J.J., Woodland, P.C., Tree-based state tying for high accuracy acoustic modelling (1994) Proc. Workshop Human Lang. Technol. (HLT'94) 
504 |a Deng, L., Yu, D., Deep convex network: A scalable architecture for speech pattern classification (2011) Proc. Interspeech, Florence, Italy, , Aug 
504 |a Huang, P., Deng, L., Hasegawa-Johnson, M., He, X., Random features for kernel deep convex network (2013) Proc. ICASSP, , Vancouver, BC, USA, May 
504 |a Mohamed, A., Graves, A., Jaitly, N., Hybrid speech recognition with deep bidirectional LSTM (2013) Proc. IEEE Workshop Speech Recognit. Understand., Olomouc, Czech Republic, , Dec 
504 |a Le Cun, Y., Bengio, Y., (1995) Convolutional Networks for Images, Speech, and Time-Series, pp. 255-258. , Cambridge, MA, USA: MIT Press 
504 |a Scheffer, N., Lei, Y., Ferrer, L., Factor analysis back ends for MLLR transforms in speaker recognition (2011) Proc. Interspeech, Florence, Italy, , Aug 
504 |a Kenny, P., Ouellet, P., Dehak, N., Gupta, V., Dumouchel, P., A study of inter-speaker variability in speaker verification (2008) IEEE Trans. Audio, Speech, Lang. Process., 16 (4), pp. 980-988. , Jul 
504 |a Matejka, P., Plchot, O., Soufifar, M., Glembek, O., D'haro Enríquez, L.F., Veselý, K., Ma, J., Dehak, N., Patrol team language identification system for DARPA RATS P1 evaluation (2012) Proc. Interspeech, Portland, OR, USA, , Sep 
504 |a Lawson, A., McLaren, M., Lei, Y., Mitra, V., Scheffer, N., Ferrer, L., Graciarena, M., Improving language identification robustness to highly channel-degraded speech through multiple system fusion (2013) Proc. Interspeech, Lyon, France, , Aug 
504 |a McLaren, M., Lawson, A., Lei, Y., Scheffer, N., Adaptive Gaussian backend for robust language identification (2013) Proc. Interspeech, Lyon, France, , Aug 
504 |a Penagarikano, M., Varona, A., Diez, M., Rodriguez-Fuentes, L.J., Bordel, G., Study of different backends in a state-of-The-Art language recognition system (2012) Proc. Interspeech, Portland, OR, USA, , Sep 
504 |a Brummer, N., Van Leeuwen, D.A., On calibration of language recognition scores (2006) Proc. Odyssey'06, San Juan, Puerto Rico, , Jun 
504 |a Van Leeuwen, D.A., Brummer, N., Channel-dependent GMM and multi-class logistic regression models for language recognition (2006) Proc. Odyssey'06, San Juan, Puerto Rico, , Jun 
504 |a NIST LRE09 Evaluation Plan, , http://www.itl.nist.gov/iad/mig/tests/lre/2009/LRE09_EvalPlan_v6.pdf, [Online]. Available 
504 |a Bielefeld, B., Language identification using shifted delta cepstrum (1994) Proc. 14th Annu. Speech Res. Symp. 
504 |a Ferrer, L., Bratt, H., Burget, L., Cernocky, H., Glembek, O., Graciarena, M., Lawson, A., Scheffer, N., Promoting robustness for speaker modeling in the community: The PRISM evaluation set (2011) Procs. SRE11 Anal. Workshop, Atlanta, GA, USA, , Dec 
504 |a Jancik, Z., Plchot, O., Brümmer, N., Burget, L., Glembek, O., Hubeika, V., Karafiát, M., Strasheim, A., Data selection and calibration issues in automatic language recognition-investigation with BUT-AGNITIO NIST LRE 2009system (2010) Proc. Odyssey'10, Brno, Czech Republic, , Jun 
504 |a D'haro Enríquez, L.F., Glembek, O., Plchot, O., Matejka, P., Soufifar, M., Córdoba Herralde, R., Cernock, J., Phonotactic language recognition using i-vectors and phoneme posteriogram counts (2012) Proc. Interspeech, Portland, OR, USA, , Sep 
504 |a Walker, K., Strassel, S., The RATS radio traffic collection system (2012) Proc. Odyssey'12: Speaker Lang. Recognit. Workshop 
504 |a DARPA RATS Program, , http://www.darpa.mil/Our_Work/I2O/Programs/Robust_Automatic_Transcription_of_Speech_(RATS).aspx, [Online]. Available 
504 |a Ma, J.Z., Zhang, B., Matsoukas, S., Mallidi, S.H.R., Li, F., Hermansky, H., Improvements in language identification on the rats noisy speech corpus (2013) Proc. Interspeech, Lyon, France, , Aug 
504 |a Kim, C., Stern, R.M., Power-normalized cepstral coefficients (PNCC) for robust speech recognition (2012) Proc. ICASSP, Kyoto, Japan, pp. 4101-4104. , Mar 
504 |a McLaren, M., Lei, Y., Improved speaker recognition using DCT coefficients as features (2015) Proc. ICASSP, Brisbane, Australia, pp. 4430-4434. , May 
504 |a McLaren, M., Scheffer, N., Graciarena, M., Ferrer, L., Lei, Y., Improving speaker identification robustness to highly channel-degraded speech through multiple system fusion (2013) Proc. ICASSP, Vancouver, BC, Canada, , May 
504 |a McLaren, M., Graciarena, M., Lei, Y., Softsad: Integrated framebased speech confidence for speaker recognition (2015) Proc. ICASSP, Brisbane, Australia, pp. 4694-4698. , May 
520 3 |a This paper compares different approaches for using deep neural networks (DNNs) trained to predict senone posteriors for the task of spoken language recognition (SLR). These approaches have recently been found to outperform various baseline systems on different datasets, but they have not yet been compared to each other or to a common baseline. Two of these approaches use the DNNs to generate feature vectors which are then processed in different ways to predict the score of each language given a test sample. The features are extracted either from a bottleneck layer in the DNN or from the output layer. In the third approach, the standard i-vector extraction procedure is modified to use the senones as classes and the DNN to predict the zeroth order statistics. We compare these three approaches and conclude that the approach based on bottleneck features followed by i-vector modeling outperform the other two approaches. We also show that score-level fusion of some of these approaches leads to gains over using a single approach for short-duration test samples. Finally, we demonstrate that fusing systems that use DNNs trained with several languages leads to improvements in performance over the best single system, and we propose an adaptation procedure for DNNs trained with languages with less available data. Overall, we show improvements between 40% and 70% relative to a state-of-the-art Gaussian mixture model (GMM) i-vector system on test durations from 3 seconds to 120 seconds on two significantly different tasks: the NIST 2009 language recognition evaluation task and the DARPA RATS language identification task. © 2015 IEEE.  |l eng 
593 |a Computer Science Department, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, C1428EGA Autonomous City of Buenos, Buenos, Argentina 
593 |a CONICET, C1425FQB Autonomous of Buenos Aires, Buenos Aires, Argentina 
593 |a Speech Technology and Research Laboratory, SRI International, Menlo Park, CA 94025, United States 
593 |a Facebook, Inc., Menlo Park, CA 94025, United States 
690 1 0 |a DEEP NEURAL NETWORKS (DNNS) 
690 1 0 |a SENONES 
690 1 0 |a SPOKEN LANGUAGE RECOGNITION (SLR) 
690 1 0 |a FORECASTING 
690 1 0 |a GAUSSIAN DISTRIBUTION 
690 1 0 |a SPEECH RECOGNITION 
690 1 0 |a VECTORS 
690 1 0 |a BOTTLENECK FEATURES 
690 1 0 |a DEEP NEURAL NETWORKS 
690 1 0 |a GAUSSIAN MIXTURE MODEL 
690 1 0 |a LANGUAGE IDENTIFICATION 
690 1 0 |a LANGUAGE RECOGNITION 
690 1 0 |a SCORE-LEVEL FUSION 
690 1 0 |a SENONES 
690 1 0 |a SPOKEN LANGUAGE RECOGNITION 
690 1 0 |a COMPUTATIONAL LINGUISTICS 
700 1 |a Lei, Y. 
700 1 |a McLaren, M. 
700 1 |a Scheffer, N. 
773 0 |d Institute of Electrical and Electronics Engineers Inc., 2016  |g v. 24  |h pp. 105-116  |k n. 1  |p IEEE ACM Trans. Audio Speech Lang. Process.  |x 23299290  |t IEEE/ACM Transactions on Audio Speech and Language Processing 
856 4 1 |u https://www.scopus.com/inward/record.uri?eid=2-s2.0-84957047938&doi=10.1109%2fTASLP.2015.2496226&partnerID=40&md5=d37470e13e31650aa1ae4ad2c761db31  |y Registro en Scopus 
856 4 0 |u https://doi.org/10.1109/TASLP.2015.2496226  |y DOI 
856 4 0 |u https://hdl.handle.net/20.500.12110/paper_23299290_v24_n1_p105_Ferrer  |y Handle 
856 4 0 |u https://bibliotecadigital.exactas.uba.ar/collection/paper/document/paper_23299290_v24_n1_p105_Ferrer  |y Registro en la Biblioteca Digital 
961 |a paper_23299290_v24_n1_p105_Ferrer  |b paper  |c PE 
962 |a info:eu-repo/semantics/article  |a info:ar-repo/semantics/artículo  |b info:eu-repo/semantics/publishedVersion 
999 |c 77357