Text Analysis with AntConc for social media data: Keyword Lists and Keyness

The Keyword List tool measures which words are unusually frequent or infrequent in datasets or corpora compared to a reference/benchmark word list (or files).

This post is part of a series of tutorials about AntConC:

Uploading a Keyword List and analyzing keywords / keyness metric
1. As usual, the first step is to open your files/corpora and produce a word list. For that example, we are going to use the file with 16k tweets containing the term “Brazil’ (download it): .

2. Generate the word list through the Word List tab.

3. Basically, the Keyword List tool compares your files/corpora to one or more reference/benchmark files/corpora to highlight which words are more unusually frequent or infrequent. Or, in other words, which words are keywords in file(s)/corpora.

There are three main ways of using the Keyword List tool:

Compare your file(s) to a general corpus representing a national language. Here you can use a reference word list produced from texts representing a national language.
Comparing your file(s) to past texts or wordlists. This option could be appropriate for social media data, specially to discover new info. For example, you can compare recent texts to past corpora comprising 12-month period.
Comparing texts produced by different social media communities or texts reacting to different authors/pages (e.g. Facebook comments in different periods).

4. For the sake of simplicity, we’ll a comparison with the pre-produced corpus called BNC wordlist. To use it, download the wordlist from our wordlist folder (or from the BNC website), click in Use word list(s) and click in Add Files to add the BNC_WRITTEN_wordlist.txt . Click in Load then, Apply.

5. Now you can discover which terms/words in your dataset are the most “important” in the sense they are frequent or infrequent in a unusual amount.

Most unusually frequent words (positive Keyness):

Unusually infrequent words (negative Keyness). Bear in mind that we’are analyzing Twitter data, so the presence of some terms like however or although could be related to tweet formats and limitations:

So, this posts concludes our series on AntConc. To learn more about AntConc, text analysis and corpus linguistics:

Tarcízio Silva

Pesquisa, ciência, tecnologia e sociedade, racismo algorítmico

Deixe um comentário