Spoken Hebrew Corpus and Models: Speech-to-Text and Text-to-Speech

March 19, 2023

Joseph Keshet

Joseph Keshet

Spoken Hebrew Corpus and Models: Speech-to-Text and Text-to-Speech

Application

The limited availability of digital products in Hebrew largely stems from the absence of substantial Hebrew data corpora for machine learning training. Notably, services such as speech-to-text and text-to-speech are rarely incorporated into commercial products and are entirely absent for research purposes. Furthermore, upcoming NLP models aim to forgo the transcription phase and instead focus on textless NLP centered on speech and audio vocalizations.

This gap has heightened the demand for foundational model components and datasets. This proactive approach ensures that tools for automatic speech transcription, speech generation, and other spoken language modeling will soon be developed to cater to the Hebrew language.

Our Innovation

In collaboration with Yossi Adi's lab at the Hebrew University, my lab is working on developing the first Modern Hebrew speech data corpus. This corpus will include transcriptions synchronized with sentences, paving the way for the creation of speech recognition, modeling, and synthesis systems. It will encompass 1,200 hours of diverse speech, including read, spontaneous, and clean expressive speech such as reading, emotional dialogue, and casual conversation. Our researchers are pioneering a unique recording system and software, streamlining the process of adjustments during recording, transcription, and automatic data synchronization.

The development of a comprehensive system for Hebrew speech recognition, including pronunciation (utilizing 'niqqud', a set of diacritical marks to denote vowels or alternate letter pronunciations in Hebrew), will be of commercial quality. This advancement is expected to facilitate the automatic transcription of unannotated information in the future.

Opportunity

The transcription and speech production systems will be available to academic institutions at no cost and will be licensed to industry companies. The research team welcomes both academic collaborations and commercial utilization of the dataset.

Israel Innovation Authority supports this projects and maintains a GitHub repository for NLP and other Hebrew resources.