With the motivation of improving the quality of speaker embeddings, we have collected and are releasing for academic use the BookTubeSpeech dataset, which contains many thousands of unique speakers. Audio samples from BookTubeSpeech are extracted from BookTube videos - videos where people share their opinions on books - from YouTube. The dataset can be used for applications such as speaker verification, speaker recognition, and speaker diarization. In our ICASSP'20 paper, we showed that this dataset, when combined with VoxCeleb2, yields a substantial improvement in the speaker embeddings for speaker verification when tested on LibriSpeech, compared to a model trained on only VoxCeleb2.
To collect BookTubeSpeech automatically, we followed the pipeline shown above. We pruned our initial set of 38,707 BookTube videos down to a collection of 8,450 videos with distinct speakers. The average duration of all the files is 7.74 minutes. Most videos are less than 20 minutes; see histogram below.
Below we provide the URLs for the BookTubeSpeech videos -- both the entire set of videos we initially downloaded, as well as the pruned version that contains distinct speakers.
Name | Link |
Pruned BookTubeSpeech YouTube IDs (8,450 videos, whereby each video belongs to a distinct speaker) | here |
All BookTube URLs (38,707 videos in total, whereby multiple videos may come from the same speaker) | here |
The .wav files for the 8,450 pruned BookTubeSpeech videos are also available for download, but request must be explicitly granted. Please email mnpham@wpi.edu for access.
Please cite the following paper if you make use of the dataset.
This material is based on work supported by NSF Cyberlearning grants 1822768 and 1551594.