Text Analysis with AntConc for social media data: Keyword Lists and Keyness

The Keyword List tool measures which words are unusually frequent or infrequent in datasets or corpora compared to a reference/benchmark word list (or files).

This post is part of a series of tutorials about AntConC:

  1. Intro, Opening a File and Settings
  2. Word Lists and File Viewer 
  3. Concordancer and Concordance Plot 
  4. Clusters and N-Grams
  5. Collocations 
  6. Keyword Lists

 

Uploading a Keyword List and analyzing keywords / keyness metric
1. As usual, the first step is to open your files/corpora and produce a word list. For that example, we are going to use the file with 16k tweets containing the term “Brazil’ (download it): .

2. Generate the word list through the Word List tab.

 

3. Basically, the Keyword List tool compares your files/corpora to one or more reference/benchmark files/corpora to highlight which words are more unusually frequent or infrequent. Or, in other words, which words are keywords in file(s)/corpora.

There are three main ways of using the Keyword List tool:

  • Compare your file(s) to a general corpus representing a national language. Here you can use a reference word list produced from texts representing a national language.
  • Comparing your file(s) to past texts or wordlists. This option could be appropriate for social media data, specially to discover new info. For example, you can compare recent texts to past corpora comprising 12-month period.
  • Comparing texts produced by different social media communities or texts reacting to different authors/pages (e.g. Facebook comments in different periods).

4. For the sake of simplicity, we’ll a comparison with the pre-produced corpus called BNC wordlist. To use it, download the wordlist from our wordlist folder (or from the BNC website), click in Use word list(s) and click in Add Files to add the BNC_WRITTEN_wordlist.txt . Click in Load then, Apply.

 

5. Now you can discover which terms/words in your dataset are the most “important” in the sense they are frequent or infrequent in a unusual amount.

Most unusually frequent words (positive Keyness):

 

Unusually infrequent words (negative Keyness). Bear in mind that we’are analyzing Twitter data, so the presence of some terms like however or although could be related to tweet formats and limitations:

 

So, this posts concludes our series on AntConc. To learn more about AntConc, text analysis and corpus linguistics:

Text Analysis with AntConc for social media data: Collocations

Collocations refers to how words occur regularly together in the texts/corpora. Searching for colocates related to a specific term could point to other words and expressions important in the documents.

This post is part of a series of tutorials:

  1. Intro, Opening a File and Settings
  2. Word Lists and File Viewer 
  3. Concordancer and Concordance Plot 
  4. Clusters and N-Grams
  5. Collocations (we are here)
  6. Keyword Lists (soon)

 

Exploring Collocation

1. As usual, open a file or set of files (don’t forget to configure the settings). In this tutorial, we are going to use the file plastic_19k_tweets_june_2018.txt available in our datasets folder.

2. Generate a Word List.

3. Go to the tab Collocates and search for a term like ‘plastic’. The following list ranks the more relevant collocates:

As you can see, most of the collocates are related to “plastic surgery”, not the material plastic.

 

4. A frequent problem is the listing of words which appears only one or few times in the file(s). So you can increase the Minimum Collocate Frequency:

 

5. The words will be searched in a Window Span to count the co-occurrences in the vicinity of the search term. You can increase or decrease the span on the left and on the right of the search term.

 

6. Since Twitter texts are very short, we recommend decrease the span. The results might be more precise, as in the following example:

With these results, you can explore the collocates to try to understand and locate meaningul words related to your keywords of interest.

Text Analysis with AntConc for social media data: Clusters and N-Grams

In the last post, we learn how to explore and analyze concordances in the software AntConc. Now we are going to show you how to use AntConc to study clusters, n-grams and also how to count hashtags frequency and locate them.

This post is part of a series of tutorials:

  1. Intro, Opening a File and Settings
  2. Word Lists and File Viewer 
  3. Concordancer and Concordance Plot 
  4. Clusters and N-Grams (we are here)
  5. Collocations (soon)
  6. Keyword Lists (soon)

 

Extracting N-Grams from the corpora

N-Grams are a contiguous sequence of items from a text(s). They can be used as a way to analyze phonemes, letters, syllables or words, for example. To our goals, we are going to see the required steps to generate n-grams of words in a given text.

1. Firstly, as usual, you need to open a file and generate a Word List. This time, we are going to use a corpus of  thousand tweets containing the term ‘plastic’. Download a corpus file with 19k tweets.

2. It’s important that you don’t forget to open the customized settings, because we’ll work with #hashtags and @usernames.

3. To use N-Grams, you need to go to the tab Clusters/N-Grams and check the option highlighted on the image:

 

4. Click and Start and here it is the result

 

5. You can change the minimum and maximum values of the N-Gram Size. For example, if you change them to 1 and 3, the result changes and now you can see 1-gram and 3-gram sequences of words:

 

6. To analyze specific clusters of words, you can search for them using the Search Term box.

 

7. Mind the options like Search Term Position. You can search for words before or after your terms. Searching for clusters with size 2 with the term plastic On Right, we get the following result.

 

8. Important: the column Range counts the number of Corpus Files that word or term is present. This number is specially interesting when you are comparing corpora.

 

Counting Hashtags

You can use the Clusters/N-Grams tab to count terms that have some structure, character or word in common. For example: #hashtags and @usernames in some social media platforms.

1. Open your file(s). Don’t forget to configure the settings as explained in the tutorial on files and settings.

2. Generate a Word List.

3. To generate a list of hashtags, for example, you should go in Clusters/N-Grams tab and use the Search Term   ‘ #* ‘ , using 1-1 values in the Cluster Size options.

 

Troubleshooting: counting hashtags didn’t worked? Probably, AntConc is configured with the default settings, where the ‘#’ symbol is a wildcard. Check!

Text Analysis with AntConc for social media data: Concordancer and Concordance Plot

In the lasts posts, we learn the basics about AntConc and how to generate and analyze word lists. Now we are going to show you how to use AntConc to study concordances through allowing the navigation of instances of a keywords in context.

Don’t forget that this post is part of a series of tutorials:

  1. Intro, Opening a File and Settings
  2. Word Lists and File Viewer 
  3. Concordancer and Concordance Plot (we are here)
  4. Clusters and N-Grams (soon)
  5. Collocations (soon)

 

Concordance basics

1. Open your file(s). In that example, we are going to use a simple example of only two texts: the english webpages about the countries United States and United Kingdom. You can find them at our datasets folder.

 

2. Generate a Word List.

 

3. Click on the Concordance tab. Now you can search for a letter, word or expression. We searched for the term ‘war’ and, as you can see, we got a total of 107 Concordance Hits. This is the sum of the total amount of results to that specific search in the two files.

 

4. By default, the search term will be highlighted in blue and up to three terms on the right will be highlighted in red, green and purple. You can change those colors in Global Settings.

They are useful to understand and sort the words in the vicinity of our term (or terms) of interest.

 

5. At the bottom, you can change the Kwic (Keyword-in-context) settings and choose which items you want to highlight and sort in the Concordance tab. You can select up to three levels. In the following images, 1R stands for “1 Right, or first item on the right”, 1L stands for “1 Left, or first item on the left” and so on.

 

Changing the levels to 1L, 2L and 3L will result on something like the following:

 

World War

 

as * as

 

Concordance plot

The concordance plot is a straightforward way to compare two or more corpora, their concordance hits and the distribution of words or phrases along them.


1. It is very simple to use the Concordance Plot. Just generate a Word List (if you haven’t already), search for a word or phrase on the Concordance tab and click on the Concordance Plot tab. Let’s use that same example from before, searching for ‘war’:

 

2. As we can see, the word ‘war’ is much more common on the United States WIkipedia page. What does that mean? It could be a indication of more interest on war between the editors of the USA wikipedia page, for example?

 

3. You can increase the zoom on the visualization in the option Plot Zoom:

 

4. You can observe in Step 2 that besides the number the fact that the characters in each file are different (126k x 135k), the rectangle representing the files has the same width. This is because the visualization is normalized to facilitate some comparisons. You can change that going in Tool Preferences -> Concordance Plot and selecting the option “Use a relative lenght” in the Plot Length Options.

Now you can see a width that better represents the length and the difference between the files:

 

Searching for a list of words

1. At the Concordance tool, as well in other AntConc tools, we can search for a list of words at once. Those words could be inserted manually or through a txt file.

 

2. In the Advanced Search options you can check the option “Use search term(s) from list below” and input the words. In the following example, we used the expressions below to include also terms like “economics”, “economically”, “finance”,  “financial” etc.

 

3. Alternatively, you can click on “Load File” and upload words from a .txt file.

 

4. With this technique you can look and explore concordances for words with semantic similarity:

 

5. Finally, another option on the Advanced Search specifications is to use Context Words and Horizons. The terms searched are filtered out according to proximity to other words. In the example below, the combination of terms related to war and conflicts to a filter of instances next to ‘crisis’ and ‘depression’ could be used for understanding the relations between war and economic problems:

 

 

Text Analysis with AntConc for social media data: intro, files and settings

Broadly defined, (computational) text analysis is a set of techniques for automated content analysis. Even without the use of complex statistics or computational analysis, social science researchers can improve their data exploration with techniques involving word counting, co-occurrence and collocations.

AntConc is one of the most easy-to-use and useful tools for text analysis and corpus linguistics. It was developed by Laurence Anthony, Professor in the Faculty of Science and Engineering at Waseda University, Japan. He maintains dozens of tools in his website like TagAnt and FireAnt.

After this intro on AntConc, we are going to see the following posts covering its main functionalities:

  1. Intro, Opening a File and Settings (we are here)
  2. Word Lists and File Viewer
  3. Concordancer and Concordance Plot
  4. Clusters and N-Grams (soon)
  5. Collocations (soon)

 

The following matrix was proposed in the paper Computational text analysis for social science: Model assumptions and complexity and summarizes the possibilities between simple statistics/computation x complex statistics/computation and between weaker and stronger domain assumption. Using Antconc for analyze social media textual data encompasses simple statistics/computation tasks such as word counting and statistics, but can be further applied on dictionary-based word counting by topic experts.

To understand and compare approaches from computer-aided content analysis, computer-aided interpretive textual analysis and corpus linguistics, I recommend the paper Taming textual data: The contribution of corpus linguistics to computer-aided text analysis.

AntConc will allow you to perform the main techniques of corpus linguistics such as Word Frequencies, Collocation, Concordance, N-Grams, Corpora Comparison to any kind of text.

But first things first! Download AntConc and read the following text, which will teach you the basics about the settings and how to open a file.

 

How to collect social media textual data?

There are dozens of social media research tools which allow to extract or monitor textual data on the main platforms. Most of them collect data through keyword/hashtag search and/or from specific pages and websites. The majority of the following tools uses UTF-8 encoding to export files in .csv format. You can open them with Excel or Libreoffice and copy-paste the desired texts to a notepad and save it as a .txt file.

Repositories/curated lists of tools:

If you are entirely new to analyzing social media textual data, I strongly recommend you to try the awesome and user-friendly tool Netlytic and collect some tweets or youtube comments. But don’t worry: I’m going to give you some datasets in the following posts.

 

File Formats

AntConc can read several text formats: .txt, .html, .xml, .ant. The simpler one is the .txt file.

File Format Description
.txt .TXT is the simpler format to store text files. Softwares like Notepad, Notepad++, TextMate, Word and most of the word editing softwares can save your files as .txt.
.html .HTML is the standard format for saving web pages. You can save a webpage and upload it to AntConc.

AntConc has some settings to ignore text between the characters “< and “> used on HTML files.

.xml .XML files: Extensible Markup Language. It is similar to .HTML document, but uses custom tags to define objects and the data within each object. In corpus linguistics/text analysis, it is frequently used to mark each word with word categories in Part-of-Speech Tagging.
.ant .ANT is a file format used by AntConc, interchangeable with txt. It only saves the data on the current screen as an output.

 

  1. Encoding

It is recommended that you save your text files with UTF-8 encoding. A character encoding is a standard on how to process characters and symbols. UTF-8 is defined by the Unicode Standard, which englobes characters used in most Western languages and scripts. Due to that, several data collection tools use UTF-8 encoding as a standard. So, remember to save your files in UTF-8 encoding!

 

Optimal Settings for Social Media Texts

  1. Pre-configured settings

AntConc was not developed just for social media data but, instead, to analyse all sorts of texts, mainly literature, natural language and language corpora. It requires some adjustments on the software settings.

The specifications are listed below, but instead of following each step, you could just import a Settings File with the recommended definitions. Download the file antconc_settings_for_social_media.ant and, on AntConc, go to File -> Import Settings from File…, select and open the file:

 

That’s it! Now AntConc can be more useful for social media analysis. You can skip the following settings description if you have already imported the file.

 

2. Global Settings – Token Definition

In this section, we explain the recommended settings. Remember: you don’t need to follow these steps if you have just uploaded the pre-configured settings file provided above.

Firstly, we configured the token settings. A token is an element (word, character, punctuation, symbol, etc). In the Token Definition Settings, you can define which characters/symbols AntConc will consider when counting and processing your text data.

The default ettings are the following:

Default:

But, when we are working with social media data, there are some special characters used by social media users which represent specific conversation and affiliation practices. Two of them are very important:

The ‘@’ symbol: for Twitter users, the [at] symbol is used to mark user profiles. So, it is important to append the ‘@’ symbol. This will allow us, for example, to count the most mentioned Twitter’s users or the opposite: to filter out the usernames.

The ‘#’ symbol, in its turn, is a type of metadata used on most social media platforms to define hashtags. You need to append ‘#’ in AntConc token definitions to properly count hashtags.

Recommended:

So, we just need to go to Global Settings -> Token Definition, check the box “Append Following Definition” and include the signs ‘#’ and ‘@’.

3. Wildcard definitions

A WildCard is a character that can be substituted by a character, word or symbol during a query. AntConc has seven different wildcards. Below we can see the default settings (Global Settings -> Wildcards).

The problem is that two of these wildcards are attributed to very important signs on social media data: ‘#’ and ‘@’. This result means that AntConc “ignores” these two signs in the results, because they are reserved as wildcards.

So, we recommend to change these two wildcard to other signs. In the example below, we changed them for ‘{ and ‘} .

 

 

Opening your File or Corpora

  1. Opening your File(s)

To open a file or a set of files in AntConc you just need to go to File -> Open File(s)… or File -> Open Dir.

With the option File -> Open File(s) you can select one or more files.

If you open two or more files, AntConc will apply your queries and analyses on all of them at once.

This is very useful for managing datasets/corpora. For example: you could be analysing a year of data and save the texts (comments, posts, tweets) for each month in a different file. Open the 12 files at once allows you to compare things like: countings of specific words in the Concordance Plot; or Range of presence of clusters/n-grams.

 

Now we can talk about counting word frequencies. See you next post: Word Lists, Word Frequencies and File View