Remus-Dan Ungureanu and Mihai Dascalu
pp. 141 – 152, download
(https://doi.org/10.55612/s-5002-062-009)
Abstract
Romanian is the seventh most popular European language, with around 30 million speakers worldwide. Despite its popularity, the available speech resources are limited. As a result, there are few models that transcribe Romanian well, most of them being multilingual models that also cover less popular languages. Echo is a crowd-sourcing platform that has collected more than 300 hours of speech from various contributors. In this study, we document how a large speech dataset enables researchers to train automatic speech recognition, speaker verification, and diarization models to automatically process students’ notes. We publicly release both the dataset and the Whisper-based baseline model as open-source
Keywords:speech dataset, Romanian language, crowd-sourcing.
CRediT Authors Statement. Remus-Dan Ungureanu: Conceptualization, Investigation, Methodology, Formal analysis, Software, Resources, Data curation, Writing – original draft preparation. Mihai Dascalu: Conceptualization, Methodology, Validation, Writing – review and editing, Supervision, Project administration, Funding acquisition.
References
1. Amodei D., Ananthanarayanan S., Anubhai R., Bai J., Battenberg E., Case C., Casper J., Catanzaro B., Cheng Q., Chen G., et al.: Deep speech 2: End-to-end speech recognition in english and mandarin. In: International conference on machine learning, pp. 173–182, PMLR (2016)
2. Ardila R., Branson M., Davis K., Henretty M., Kohler M., Meyer J., Morais R., Saunders L., Tyers F.M., Weber G.: Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670 (2019)
3. Baevski A., Zhou Y., Mohamed A., Auli M.: wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33, pp. 12449–12460 (2020)
4. Chen G., Chai S., Wang G., Du J., Zhang W.Q., Weng C., Su D., Povey D., Trmal J., Zhang J. et al.: Gigaspeech: An evolving, multidomain ASR corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909 (2021)
5. Conneau A., Ma M., Khanuja S., Zhang Y., Axelrod V., Dalmia S., Riesa J., Rivera C., Bapna A.: Fleurs: Few-shot learning evaluation of universal representations of speech. In: 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 798–805, IEEE (2023)
6. Defined.ai: Dutch spontaneous dialogue dataset (nd), URL https://defined. ai/datasets/dutch-spontaneous-dialogue, accessed on 14.06.2024
7. Georgescu A.L., Caranica A., Cucu H., Burileanu C.: Rodigits-a romanian connected-digits speech corpus for automatic speech and speaker recognition. University Politehnica of Bucharest Scientific Bulletin, Series C 80(3), pp. 45–62 (2018)
8. Georgescu A.L., Cucu H., Buzo A., Burileanu C.: Rsc: A romanian read speech corpus for automatic speech recognition. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 6606–6612 (2020)
9. Hannun A., Case C., Casper J., Catanzaro B., Diamos G., Elsen E., Prenger R., Satheesh S., Sengupta S., Coates A., et al.: Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014)
10. Radford A., Kim J.W., Xu T., Brockman G., McLeavey C., Sutskever I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp. 28492–28518, PMLR (2023)
11. Stan A., Dinescu F., T¸iple C., Meza S¸., Orza B., Chiril˘a M., Giurgiu M.: The swara speech corpus: A large parallel romanian read speech dataset. In: 2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pp. 1–6, IEEE (2017)
12. Ungureanu D., Badeanu M., Marica G.C., Dascalu M., Tufis D.I.: Establishing a baseline of romanian speech-to-text models. In: 2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pp. 132–138, IEEE (2021)
13. Ungureanu D., Toma S.A., Filip I.D., Mocanu B.C., Aciob˘anit,ei I., Marghescu B., Balan T., Dascalu M., Bica I., Pop F.: Odin112–aiassisted emergency services in romania. Applied Sciences 13(1), p. 639 (2023)
14. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser, Polosukhin I.: Attention is all you need. In: 31st Conference on Neural Information Processing Systems (NIPS 2017), pp. 5998–6008 (2017)
15. Wang C., Riviere M., Lee A., Wu A., Talnikar C., Haziza D., Williamson M., Pino J., Dupoux E.: Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. arXiv preprint arXiv:2101.00390 (2021)
16. Zhang Y., Park D.S., Han W., Qin J., Gulati A., Shor J., Jansen A., Xu Y., Huang Y., Wang S. et al.: Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition. IEEE Journal of Selected Topics in Signal Processing 16(6), pp. 1519–1532 (2022)