Text Analysis with AntConc for social media data: Keyword Lists and Keyness

Publicado em 06/05/2019 por Tarcízio Silva

The Keyword List tool measures which words are unusually frequent or infrequent in datasets or corpora compared to a reference/benchmark word list (or files).

This post is part of a series of tutorials about AntConC:

Uploading a Keyword List and analyzing keywords / keyness metric
1. As usual, the first step is to open your files/corpora and produce a word list. For that example, we are going to use the file with 16k tweets containing the term “Brazil’ (download it): .

2. Generate the word list through the Word List tab.

3. Basically, the Keyword List tool compares your files/corpora to one or more reference/benchmark files/corpora to highlight which words are more unusually frequent or infrequent. Or, in other words, which words are keywords in file(s)/corpora.

There are three main ways of using the Keyword List tool:

Compare your file(s) to a general corpus representing a national language. Here you can use a reference word list produced from texts representing a national language.
Comparing your file(s) to past texts or wordlists. This option could be appropriate for social media data, specially to discover new info. For example, you can compare recent texts to past corpora comprising 12-month period.
Comparing texts produced by different social media communities or texts reacting to different authors/pages (e.g. Facebook comments in different periods).

4. For the sake of simplicity, we’ll a comparison with the pre-produced corpus called BNC wordlist. To use it, download the wordlist from our wordlist folder (or from the BNC website), click in Use word list(s) and click in Add Files to add the BNC_WRITTEN_wordlist.txt . Click in Load then, Apply.

5. Now you can discover which terms/words in your dataset are the most “important” in the sense they are frequent or infrequent in a unusual amount.

Most unusually frequent words (positive Keyness):

Unusually infrequent words (negative Keyness). Bear in mind that we’are analyzing Twitter data, so the presence of some terms like however or although could be related to tweet formats and limitations:

So, this posts concludes our series on AntConc. To learn more about AntConc, text analysis and corpus linguistics:

Text Analysis with AntConc for social media data: Clusters and N-Grams

Publicado em 02/05/2019 por Tarcízio Silva

In the last post, we learn how to explore and analyze concordances in the software AntConc. Now we are going to show you how to use AntConc to study clusters, n-grams and also how to count hashtags frequency and locate them.

This post is part of a series of tutorials:

Intro, Opening a File and Settings
Word Lists and File Viewer
Concordancer and Concordance Plot
Clusters and N-Grams (we are here)
Collocations (soon)
Keyword Lists (soon)

Extracting N-Grams from the corpora

N-Grams are a contiguous sequence of items from a text(s). They can be used as a way to analyze phonemes, letters, syllables or words, for example. To our goals, we are going to see the required steps to generate n-grams of words in a given text.

1. Firstly, as usual, you need to open a file and generate a Word List. This time, we are going to use a corpus of thousand tweets containing the term ‘plastic’. Download a corpus file with 19k tweets.

2. It’s important that you don’t forget to open the customized settings, because we’ll work with #hashtags and @usernames.

3. To use N-Grams, you need to go to the tab Clusters/N-Grams and check the option highlighted on the image:

4. Click and Start and here it is the result

5. You can change the minimum and maximum values of the N-Gram Size. For example, if you change them to 1 and 3, the result changes and now you can see 1-gram and 3-gram sequences of words:

6. To analyze specific clusters of words, you can search for them using the Search Term box.

7. Mind the options like Search Term Position. You can search for words before or after your terms. Searching for clusters with size 2 with the term plastic On Right, we get the following result.

8. Important: the column Range counts the number of Corpus Files that word or term is present. This number is specially interesting when you are comparing corpora.

Counting Hashtags

You can use the Clusters/N-Grams tab to count terms that have some structure, character or word in common. For example: #hashtags and @usernames in some social media platforms.

1. Open your file(s). Don’t forget to configure the settings as explained in the tutorial on files and settings.

2. Generate a Word List.

3. To generate a list of hashtags, for example, you should go in Clusters/N-Grams tab and use the Search Term ‘ #* ‘ , using 1-1 values in the Cluster Size options.

Troubleshooting: counting hashtags didn’t worked? Probably, AntConc is configured with the default settings, where the ‘#’ symbol is a wildcard. Check!

Text Analysis with AntConc for social media data: Concordancer and Concordance Plot

Publicado em 28/04/2019 por Tarcízio Silva

In the lasts posts, we learn the basics about AntConc and how to generate and analyze word lists. Now we are going to show you how to use AntConc to study concordances through allowing the navigation of instances of a keywords in context.

Don’t forget that this post is part of a series of tutorials:

Intro, Opening a File and Settings
Word Lists and File Viewer
Concordancer and Concordance Plot (we are here)
Clusters and N-Grams (soon)
Collocations (soon)

Concordance basics

1. Open your file(s). In that example, we are going to use a simple example of only two texts: the english webpages about the countries United States and United Kingdom. You can find them at our datasets folder.

2. Generate a Word List.

3. Click on the Concordance tab. Now you can search for a letter, word or expression. We searched for the term ‘war’ and, as you can see, we got a total of 107 Concordance Hits. This is the sum of the total amount of results to that specific search in the two files.

4. By default, the search term will be highlighted in blue and up to three terms on the right will be highlighted in red, green and purple. You can change those colors in Global Settings.

They are useful to understand and sort the words in the vicinity of our term (or terms) of interest.

5. At the bottom, you can change the Kwic (Keyword-in-context) settings and choose which items you want to highlight and sort in the Concordance tab. You can select up to three levels. In the following images, 1R stands for “1 Right, or first item on the right”, 1L stands for “1 Left, or first item on the left” and so on.

Changing the levels to 1L, 2L and 3L will result on something like the following:

World War

as * as

Concordance plot

The concordance plot is a straightforward way to compare two or more corpora, their concordance hits and the distribution of words or phrases along them.

1. It is very simple to use the Concordance Plot. Just generate a Word List (if you haven’t already), search for a word or phrase on the Concordance tab and click on the Concordance Plot tab. Let’s use that same example from before, searching for ‘war’:

2. As we can see, the word ‘war’ is much more common on the United States WIkipedia page. What does that mean? It could be a indication of more interest on war between the editors of the USA wikipedia page, for example?

3. You can increase the zoom on the visualization in the option Plot Zoom:

4. You can observe in Step 2 that besides the number the fact that the characters in each file are different (126k x 135k), the rectangle representing the files has the same width. This is because the visualization is normalized to facilitate some comparisons. You can change that going in Tool Preferences -> Concordance Plot and selecting the option “Use a relative lenght” in the Plot Length Options.

Now you can see a width that better represents the length and the difference between the files:

Searching for a list of words

1. At the Concordance tool, as well in other AntConc tools, we can search for a list of words at once. Those words could be inserted manually or through a txt file.

2. In the Advanced Search options you can check the option “Use search term(s) from list below” and input the words. In the following example, we used the expressions below to include also terms like “economics”, “economically”, “finance”, “financial” etc.

3. Alternatively, you can click on “Load File” and upload words from a .txt file.

4. With this technique you can look and explore concordances for words with semantic similarity:

5. Finally, another option on the Advanced Search specifications is to use Context Words and Horizons. The terms searched are filtered out according to proximity to other words. In the example below, the combination of terms related to war and conflicts to a filter of instances next to ‘crisis’ and ‘depression’ could be used for understanding the relations between war and economic problems:

Text Analysis with AntConc for social media data: Word Lists, Word Frequencies and File View

Publicado em 23/04/2019 por Tarcízio Silva

In the last post, we learn the basics about AntConc. Now we are going to show you how to use AntConc to generate word lists (and frequencies) and the useful File Viewer.

Don’t forget that this post is part of a series of tutorials:

Intro, Opening a File and Settings
Word Lists and File Viewer (we are here)
Concordancer and Concordance Plot
Clusters and N-Grams (soon)
Collocations (soon)

AntConc functions are accessed through the seven tabs below:

In that basic tutorial, we are going to follow the steps to produce simple word lists.

Remember to open your file and import the recommended settings for social media research [tutorial].

Generating and navigating in a simple Word List

Open your file(s). In the examples below I’m going to use a dataset with 16k tweets in english containing the word ‘brazil’ (collected through Netlytic). Download the file brazil_tweets_16732tweets_2017_11_30.txt in our folder.

2. In the Word List tab, click on Start and wait a few seconds

3. Now you can explore and navigate the data, scrolling down to find meaningful words, sort by Frequency, Word (alphabetical order) or Word end.

4. You can search for a term on the box at the bottom and click on the button “Search Only”:

5. If you click on any word, you’ll be directed to the Concordance tab. You can also read a tutorial on the Concordance tool (soon)

6. And if you click in any word on the Concordance tool, you’ll be directed to the File View tool. It functions like a simple text reader, where you can see the full corpora.

7. To export the list, just go back to the Word List tab and click on File -> Save Output.

8. The output is a .txt file that looks like this:

9. Then you can open or copy-paste the output in a spreadsheet software like Excel or Libreoffice to further analyses.

Filtering out stopwords

Stopwords are words that you don’t want to count or visualize. Usually, they are the most common words without semantic or topical relevance for your research problem (such as articles, pronouns and some adverbs).

First, you need a stopword list! You can produce or edit a list yourself, but let’s start with an example list. You can download it on the lists folder.

2. To upload a stopwords lists from a .txt file, go to Tool Preferences -> Word List. There you’ll see the option “Use a stoplist below” in the “Word List Range” section. Click on Open and select your .txt file.

If you have done it right, the words will show in the box:

Now you just click on “Apply”!

Go back to the Word List tab and click again the button “Start”. Compare the two word lists below. The first one was the original word list and the second one is the list with stopwords filtered out:

Counting specific words

Other useful Word List option is to count only specific words that you already know or that you just discovered in your corpora/datasets. Follow the steps below:

1. Firstly, you’ll need a word list. In our case, we are going to upload a list of words of brazilian soccer teams like that below:

2. Go to Tool Preferences -> Word Lists and open the Words from the file (download it in the lists folder). Click on “Use specific words below” and Apply.

3. Go to the Word List tab and click on ‘Start’ to generate the list again. The result will be a list of only the desired words:

4. Exporting that list (through File -> Save Output) and you can produce a Treemap like that visualization below with RAW Graphs:

Counting Lemma Word Forms

This is a optional step, if you want to aggregate the inflected forms of a word. For example, the verb talk may appear as talking, talks, talking and so on. Lemmatization aggregates those inflected forms to talk.

On social media data, this is important to investigate variations of a same root meaning, such as autism, autistic, “autist” related to a search query for vaccines for example.

Steps:

In AntConc, the first thing you’ll need is a lemma list. You can download it directly from the AntConc website or in our wordlists folder. The file will look like this:

That means you can add or remove lines of lemma correspondences. As you can see above, that list doesn’t include the word autism. We could add the following line:

autism -> autistic, autistically, autist

(even though the word ‘autist’ doesn’t exists, it could be added because it is a common error between portuguese speakers, for example)

2. To add a Lemma List you just need go to Tool Preferences -> Word List and click Load on Lemma List options.

After you selected your file, AntConc will show you a preview. Click in “OK” and then “Apply”:

3. Now you can go back to the Word List tab and generate your list again. As you can see below, now AntConc counts Lemma Types and Lemma Tokens instead of Word Types and Word Tokens:

Lemmatization can greatly improve the preciseness of some claims about your corpora.

We hope Word lists, Word Frequencies, Filtering Stop Words and Lemmatization techniques will help you to explore and analyze your social media datasets.

The next AntConc tutorial will focus on Concordancer and Concordance Plot (soon)!