Text Analysis with AntConc for social media data: Word Lists, Word Frequencies and File View

In the last post, we learn the basics about AntConc. Now we are going to show you how to use AntConc to generate word lists (and frequencies) and the useful File Viewer.

Don’t forget that this post is part of a series of tutorials:

Intro, Opening a File and Settings
Word Lists and File Viewer (we are here)
Concordancer and Concordance Plot
Clusters and N-Grams (soon)
Collocations (soon)

AntConc functions are accessed through the seven tabs below:

In that basic tutorial, we are going to follow the steps to produce simple word lists.

Remember to open your file and import the recommended settings for social media research [tutorial].

Generating and navigating in a simple Word List

Open your file(s). In the examples below I’m going to use a dataset with 16k tweets in english containing the word ‘brazil’ (collected through Netlytic). Download the file brazil_tweets_16732tweets_2017_11_30.txt in our folder.

2. In the Word List tab, click on Start and wait a few seconds

3. Now you can explore and navigate the data, scrolling down to find meaningful words, sort by Frequency, Word (alphabetical order) or Word end.

4. You can search for a term on the box at the bottom and click on the button “Search Only”:

5. If you click on any word, you’ll be directed to the Concordance tab. You can also read a tutorial on the Concordance tool (soon)

6. And if you click in any word on the Concordance tool, you’ll be directed to the File View tool. It functions like a simple text reader, where you can see the full corpora.

7. To export the list, just go back to the Word List tab and click on File -> Save Output.

8. The output is a .txt file that looks like this:

9. Then you can open or copy-paste the output in a spreadsheet software like Excel or Libreoffice to further analyses.

Filtering out stopwords

Stopwords are words that you don’t want to count or visualize. Usually, they are the most common words without semantic or topical relevance for your research problem (such as articles, pronouns and some adverbs).

First, you need a stopword list! You can produce or edit a list yourself, but let’s start with an example list. You can download it on the lists folder.

2. To upload a stopwords lists from a .txt file, go to Tool Preferences -> Word List. There you’ll see the option “Use a stoplist below” in the “Word List Range” section. Click on Open and select your .txt file.

If you have done it right, the words will show in the box:

Now you just click on “Apply”!

Go back to the Word List tab and click again the button “Start”. Compare the two word lists below. The first one was the original word list and the second one is the list with stopwords filtered out:

Counting specific words

Other useful Word List option is to count only specific words that you already know or that you just discovered in your corpora/datasets. Follow the steps below:

1. Firstly, you’ll need a word list. In our case, we are going to upload a list of words of brazilian soccer teams like that below:

2. Go to Tool Preferences -> Word Lists and open the Words from the file (download it in the lists folder). Click on “Use specific words below” and Apply.

3. Go to the Word List tab and click on ‘Start’ to generate the list again. The result will be a list of only the desired words:

4. Exporting that list (through File -> Save Output) and you can produce a Treemap like that visualization below with RAW Graphs:

Counting Lemma Word Forms

This is a optional step, if you want to aggregate the inflected forms of a word. For example, the verb talk may appear as talking, talks, talking and so on. Lemmatization aggregates those inflected forms to talk.

On social media data, this is important to investigate variations of a same root meaning, such as autism, autistic, “autist” related to a search query for vaccines for example.

Steps:

In AntConc, the first thing you’ll need is a lemma list. You can download it directly from the AntConc website or in our wordlists folder. The file will look like this:

That means you can add or remove lines of lemma correspondences. As you can see above, that list doesn’t include the word autism. We could add the following line:

autism -> autistic, autistically, autist

(even though the word ‘autist’ doesn’t exists, it could be added because it is a common error between portuguese speakers, for example)

2. To add a Lemma List you just need go to Tool Preferences -> Word List and click Load on Lemma List options.

After you selected your file, AntConc will show you a preview. Click in “OK” and then “Apply”:

3. Now you can go back to the Word List tab and generate your list again. As you can see below, now AntConc counts Lemma Types and Lemma Tokens instead of Word Types and Word Tokens:

Lemmatization can greatly improve the preciseness of some claims about your corpora.

We hope Word lists, Word Frequencies, Filtering Stop Words and Lemmatization techniques will help you to explore and analyze your social media datasets.

The next AntConc tutorial will focus on Concordancer and Concordance Plot (soon)!

Tarcízio Silva

Pesquisa, ciência, tecnologia e sociedade, racismo algorítmico

Deixe um comentário