Etavrian
keyboard_arrow_right Created with Sketch.
News
keyboard_arrow_right Created with Sketch.

Inside WAXAL, Google's African speech dataset set to reshape multilingual voice AI

Reviewed:
Andrii Daniv
3
min read
Mar 7, 2026
Minimalist Africa illustration audio funnel into AI panels training language models human toggling switch shield

Google Research announced WAXAL, a large open speech dataset for African languages, on March 6, 2026. The resource covers 27 Sub-Saharan African languages and supports automatic speech recognition and text-to-speech development. The project began in 2021 and was developed with African academic and community partners. Details are outlined in a Google Research announcement on WAXAL. An overview film of the project is available: Link to Youtube Video.

WAXAL dataset for African language speech technology

WAXAL is a large-scale, openly accessible speech corpus released by Google Research for African language technologies. According to the announcement, the WAXAL dataset currently includes languages spoken by more than 100 million people across over 26 countries. All data is released under the Creative Commons Attribution 4.0 license (CC-BY-4.0).

The initial release contains approximately 1,846 hours of transcribed natural speech for automatic speech recognition (ASR) tasks and more than 565 hours of studio-quality recordings for text-to-speech (TTS) systems. The dataset is hosted on Hugging Face as the google/WaxalNLP collection.

Key published figures from Google Research include:

  • 27 Sub-Saharan African languages in the initial corpus.
  • Approximately 1,846 hours of ASR audio and more than 565 hours of TTS audio.
  • A multi-year data collection effort beginning in 2021 with African institutions.

Key details: WAXAL-ASR and WAXAL-TTS

The WAXAL corpus combines two specialized components: WAXAL-ASR and WAXAL-TTS. WAXAL-ASR focuses on spontaneous speech for recognition, while WAXAL-TTS targets high-quality synthesis data for speech generation.

WAXAL-ASR contains around 1,846 hours of transcribed, unscripted speech. Participants described images drawn from Google's Open Images dataset, covering more than 50 topics, in their native languages. Google reports that this process captured tonal variation and code-switching common in daily communication.

WAXAL-TTS contributes over 565 hours of phonetically balanced recordings. Local contributors drafted scripts of 10,000 to 20,000 words and alternated reading and recording roles. Some participants used project funding to build custom recording boxes that supported controlled acoustic conditions.

Google states that combining unscripted ASR data with studio TTS audio is intended to support full-duplex conversational systems. The ASR component models varied speech input, while the TTS portion supplies clean reference audio for generated output.

Background, partners, and related African AI research

Google Research states that African academic and community organizations led all data collection activities for WAXAL. Each partner focused on specific language groups while following a shared methodology. Partner institutions retain ownership of the collected data while agreeing to open publication.

Named partners include Makerere University, the University of Ghana, Digital Umuganda, and Addis Ababa University. Media Trust, Loud n Clear, and the African Institute for Mathematical Sciences Senegal led many of the studio TTS recordings. Google experts advised on collection procedures and quality review.

The WAXAL framework has already supported several published studies on African speech technology. One project released a cookbook for collecting speech from people with impairments and created an open dataset for Akan speakers. Another study released a 5,000 hour speech corpus for five Ghanaian languages using controlled crowdsourcing.

Additional research has evaluated four widely used speech models - Whisper, XLS-R, MMS, and W2v-BERT - on 13 African languages using WAXAL-related resources. A separate literature review cataloged 74 datasets across 111 African languages and highlighted the value of character-based metrics such as Character Error Rate for complex languages.

Google indicates that WAXAL will continue to expand with additional languages over time. The organization presents the resource as part of a broader effort to narrow language gaps in digital technologies.

Source citations

Quickly summarize and get insighs with: 
Author
Etavrian AI
Etavrian AI is developed by Andrii Daniv to produce and optimize content for etavrian.com website.
Reviewed
Andrew Daniv, Andrii Daniv
Andrii Daniv
Andrii Daniv is the founder and owner of Etavrian, a performance-driven agency specializing in PPC and SEO services for B2B and e‑commerce businesses.
Quickly summarize and get insighs with: 
Table of contents