Speech Datasets

Datasets for Speech

We compile a list of datasets potentially relevant to your final project. We highlight a few below. You can find a much more exhaustive collection here.

  • LibriSpeech (link) (paper): large-scale (1000 hours) corpus of read English speech

  • Multilingual LibriSpeech (link) (blog) (paper): A large multilingual corpus derived from LibriVox audiobooks.

  • HarperValleyBank (link) (paper): simulated contact center calls to Harper Valley Bank in the Gridspace Mixer platform. These task-oriented conversations have been labelled with human transcripts, timing information, emotion and dialog acts model outputs, subjective audio quality, task descriptions, and speaker identity.

  • Common Voice (link) (paper): 7,335 validated hours of speech in 60 languages. Each entry in the dataset consists of a unique MP3 and corresponding text file.

  • TED-LIUM (link) (paper): 452 hours of audio from TED talks.

  • AudioMNIST (link) (paper): people saying digits. This is a really great test dataset to develop with.

  • CHiME (link) (paper): The CHiME-Home dataset is a collection of annotated domestic environment audio recordings.

  • Google Speech Commands (link): 65,000 one-second long utterances of 30 short words, by thousands of different people.

  • Fluent Speech Commands (link): contains 30,043 utterances from 97 speakers. It is recorded as 16 kHz single-channel .wav files each containing a single utterance used for controlling smart-home appliances or virtual assistant.

  • AudioSet (link) (paper): 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.

  • Urban Sounds (link) (paper): This dataset contains 1302 labeled sound recordings. Each recording is labeled with the start and end times of sound events from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music.