Easy spoken corpora with YouTube

I don’t think it’s exactly a secret that I rather like corpora. In this post I shall show you how you can create an easy spoken corpus using YouTube and a subtitle downloader. Use at your own risk, and YouTube might disable this usability at any time.

Find your videos.

Search YouTube. You know how to do this.

Download subtitles.

I used DownSub.com. It opens a pop-up ad the first time you get paste the video address in the search box but is otherwise benign.
Download your subtitles. Repeat for as many videos as required. Yes this is a pain in the bum but it’s the best I can do.

Edit text.

Open all your subtitle files in the text editor of your choice and replace nonsense/ html codes with nothing. Save them as .txt files.

Wow, a corpus!

Or a small one, depending on how much time you have. Tag the corpus if you wish, using TagAnt by Laurence Anthony.  You can open the corpus in AntWord by him, too. Free downloads.

2 Replies to “Easy spoken corpora with YouTube”

Comments are closed.