I put up a new pre-print on SocArxiv:
Creating a small corpus to inform materials design in an ongoing English for Specialist Purposes (ESP) course for Orthodontists and Orthodontic Assistants
In my work as a language teacher to a group of orthodontists and orthodontic treatment assistants, I wanted an analysis of orthodontic practitioner-to-patient discourse. Because access to authentic spoken discourse was too difficult to attain due to ethical considerations, a small corpus was constructed in order to facilitate better informed form-focused instruction. Details of the typical forms found in the corpus are given, as is an overview of the corpus construction.
I don’t think it’s exactly a secret that I rather like corpora. In this post I shall show you how you can create an easy spoken corpus using YouTube and a subtitle downloader. Use at your own risk, and YouTube might disable this usability at any time.
Find your videos.
Search YouTube. You know how to do this.
I used DownSub.com. It opens a pop-up ad the first time you get paste the video address in the search box but is otherwise benign.
Download your subtitles. Repeat for as many videos as required. Yes this is a pain in the bum but it’s the best I can do.
Open all your subtitle files in the text editor of your choice and replace nonsense/ html codes with nothing. Save them as .txt files.
Wow, a corpus!
Or a small one, depending on how much time you have. Tag the corpus if you wish, using TagAnt by Laurence Anthony. You can open the corpus in AntWord by him, too. Free downloads.
Image from Night of the Living Dead, George A. Romero 1968 – no copyright.
This is an activity similar to one I did before from this TBLT Task Ideas Linoit Board where you can get students to choose one thing from a set of limited options. For this lesson I chose eight films from the top 50 list on imdb.com and copied and pasted the story synopsis into text files. You could get your students to choose. I didn’t because I was a bit short of time for various reasons. I then set my students in groups of four to choose one film to watch together.
I ran the text files through TagAnt to tag them for parts of speech in AntConc corpus concordancer. You’ll want the TagAnt tag list handy to check grammar in AntConc.
Open the tagged files in AntConc. Check the clusters, N-grams and word frequencies (including tags). In my mini corpus I found that the most key grammar was present simple passives and also there were a lot more proper nouns than expected. I kept this in mind for Focus on Form and actually did need to focus on form on passives.
Print the untagged text files after changing fonts and tidying them in your favourite word processor. Print, cut, and pin/stick to the wall.
I had my students in groups of four take rotating turns to read for: new vocabulary, storyline, setting, and characters. For odd numbers – groups of five with two assigned to new vocab or groups of three with setting and characters both assigned to one student. Dictionary checking halfway through the task and again at the end. They then read and choose which film to watch (or which trailer to watch as homework and write about in their learning journals).
A lot of my students chose a film because ‘it was the only film we understood and liked’, which is fine, in my opinion. I told them that I don’t choose to watch films that I don’t understand the story synopses of. I also had borne in mind the number of proper nouns counted in the corpus so remembered to tell students who looked a bit stuck that if the difficult word was capitalised in the the middle of a sentence it was probably a place or a person.
It wasn’t bad but it wasn’t as good as I expected. Even with short texts. the lesson was a bit hard. Some pictures of the films probably would have been useful. Anyway, you live and learn, don’t you?
The heat got turned up this week and we looked at keywords, looked deeper at collocation and colligation and then semantic preference and discourse prosody.
As somebody always about in the internet and reading blogs about teaching, I have kind of got sick of the definition of collocate as ‘the company words keep’. It doesn’t really mean very much, and it could even apply to colligate, too.
Collocate: words that statistically occur together.
Colligate: words that have a statistical affinity for grammatical classes.
Then we get onto semantic preference. This is really interesting. These are basically collocate groups. For example, in a search of the BNC for ‘encounter’ as a noun there are a collocations with: this, first, last, final, after, second, before, every. I would say (and I could be wrong) that this forms a semantic group of chronological markers.
What this means for teachers is that if you plan to teach ‘encounter’ as a noun, you should probably consider teaching it in contexts with these markers, probably as part of a story. This could be part of a prior reflection on words likely to arise as part of a task in a syllabus or, if you’re teaching using literature, that you might want to bring in some other materials if these collocates don’t appear (although there are also collocates with ‘casual’, ‘sexual’ and ‘thrilling’ for all you risque teachers).
Getting on to discourse prosody, which is the meanings and discourse usually associated with words, this basically has sociolinguistic implications in that how a word is used within a corpus (especially one taken at a specific time/place) tells you about the cultural values associated with that word. Again, in the BNC, if one searches for ‘elderly’ that it appears they need care, particularly health care and that they are vulnerable, which is also backed up by collocations.
For teachers, this means you might look at this word (for recycling especially) when teaching lessons based on health, or talk about health and infirmity when talking about age (and whether these should be taken as a given, of course; always question discourse!)
I haven’t had much time to play with AntConc this week because I want to make a better corpus to mess about with. Still, interesting stuff.
I joined the Futurelearn/Lancaster University Corpus MOOC (Massive Open Online Course) this week to supplement the module on technology and corpus linguistics I’m studying for my MA.
So far, so good. I’ve managed to watch all of the video lectures and I’ve done a good deal of the reading. It’s just a bit of a dip of your toe in the water this week but it was useful to read about different types of corpora as well as how to read the frequency data and so on (spoiler: think of source material and how wide it is).
One thing that did come up that I wanted to reflect upon was something said in one of the lectures:
Corpora may be used by language teachers to check frequency of occurrence so they may decide to teach their learners more high-frequency items.
It sounds right but then what about sequences of acquisition? Sure, single words, especially simple nouns or verbs might be chosen, but could it be the case that some high-frequency structures are acquired later than less frequent ones? I think I have more reading to do!
The other day I had a lesson with my TOEIC class at one of the universities I teach at and we were having a vocabulary review. I decided to check knowledge of collocations by using some collocation forks and have my students check things out using COCA.
That part of the lesson worked well; after getting the students used to productive use of the corpus and reducing the number of lines, all was good. Checking things like ‘take in’ they found that it has mainly visual or cognitive stimuli that collocates.
It might have looked like my students were having a faff about on their phones but if I don’t teach them how to use a corpus in lessons, they probably won’t be able to use it without guidance at home.
I have in the past used Twitter as a corpus with students but it doesn’t work very well to give concordances all the time.
If you are interested in working with COCA, you should definitely give Mura Nava‘s Cup of COCA posts out.