Week 9: Sample Analyses: Counts & Time Series

In this session we focus on querying the data contained in our database to extract information allowing us a series of typical analyses often performed with Twitter data. This week, we will focus on two analytical approaches–counts and time series. Next session, we will discuss network analysis. In the tutorial we describe these analytical approaches in detail and list exemplary studies illustrating these approaches on pages 42-79.

We will query the database from Python using a series of predefined commands. As before, we use peewee to communicate with our SQLite database from Python. Make sure to examine the workings of these commands in detail as discussed in session 7.

After exporting the summary statistics ready for analysis, we load them into R to perform a series of typical analyses. You find introductory readings on using R, exploratory data analysis in R, plotting data in R, time series analysis, and network analysis in the background readings.

The example scripts provided are specified to work with an example dataset collected by us during on the Republican Primary debates in the autumn of 2015. You can download a replication dataset through Twitter’s “hydrate” function following the instructions in Jürgens & Jungherr, p. 42. Of course you can adapt our commands provided in the files “example.py” and “database.py” according to your interest. Still, presently they are optimized to working with our sample dataset.

Code Examples:

Counts

In this session, we focus on how to extract summary statistics from our database allowing a series of standard analytical approaches to Twitter data.

First, let’s point the command line to your working directory for this project

cd "/Users/(...)/twitterresearch"

Now, let’s get some data. If you are directly participating in the course you will be provided with a data file allowing for a shared analysis. If you are following this course without actually participating in it physically, have a look at the tutorial pp. 43f.

After saving the data file in your working directory under the name tweets.db we are ready to load the data into our databse as discussed during our last session.

ipython
run database

Now, let’s count some entities!

Tweet.select().count()
User.select().count()

OK, now let’s export a ranked list of the accounts most often mentioned in the tweets in our database. For this, we need a function defined in our file examples.py.

import examples
examples.export_mention_totals()

Check your working directory. Now, you’ll find a new file in your working directory called mention_totals.csv. In the file you find a list of 50 accounts most often mentioned in the the tweets collected in the database.

Or maybe you are interested in the most often retweeted accounts:

examples.export_retweet_totals()

After focusing on users, let’s have look at dominating objects:

Hashtags:

examples.export_hashtag_totals()

Retweets:

examples.export_retweet_text()

URLs:

examples.export_url_totals()

How does this look over time?

Time Series

While already counting entities can provide you with interesting research topics (see the tutorial for a more detailed discussion), examining the development of entities over time provides even richer source material (see tutorial).

Here, we show you how you can extract data documenting temporal trends in the appearance of specific

Twitter entitities

Let’s start by exporting the daily message count during the week’s worth of messages in our example data set:

examples.export_total_counts()

Now you should find a new .csv file in your working directory listing the total message count for each day covered in our data set. To work with this and other output files load the into R:

Leave Python for now and start R Studio.

If you haven’t done so already, now it is time toa install a small selection of necessary packages:

install.packages(c("ggplot2","scales"))

After installing them in R make sure you load them to your workspace:

library(ggplot2)
library(scales)

As with Python point R to your working directory:

setwd(".../twitterresearch")

Now, we have to load the data exported before into R objects that allow analysis and plotting. As a first step load the complete file into a data frame:

message_counts_df<-data.frame(read.csv("total_counts.csv"))

Now extrate the column containing date information in a date format,…

dates<-as.POSIXct(message_counts_df[,1])
dates<-as.POSIXct(total_counts[,1])

…load the column with daily message counts,…

all_messages<-message_counts_df[,2]
all_messages<-total_counts[,2]

…and combine the both in a new data frame ready for plotting:

plot_all_messages_df<-data.frame(dates,all_messages)

Now we are ready to plot the data. For plotting we use the R pacakge ggplot2 by Hadley Wickham. This package supports you tremendously in the creation of simple and very complicated plots. It’s notation might take a little getting use to but it is definitely worth your time if you aim to keep working with data.

Now we load the data frame into ggplot and specify our prefered layout.

plot_all_messages<-ggplot(plot_all_messages_df,aes(x=dates,all_messages))+
         geom_line(stat="identity") +
         geom_point(size=2)+
         theme_bw()+
         xlab("")+
         theme(axis.text.x=element_text(angle=45,hjust=1))+
         ylab("Message Volume, Daily")

plot_all_messages

ggsave(file="Message Volume Daily.pdf", width = 170, height = 90, unit="mm", dpi=300)
dev.off()

For the detailed workings of this command see introductory books by Wickham or Chang.

Make sure to check out the tutorial and our scripts in greater detail as to see what types time series exports we have already implemented. If these sample commands do not cover your interests it’s time to get to work for and adapt our code according to your interests.

In this session, we only had time for the most cursory of glances at exporting data and using them in analyses. But not to worry, in the tutorial we cover this step of the research process with digital trace data in much greater detail. There, we discuss potential research designs, list exemplary studies illustrating potential analytical approaches, and provide detailed code examples in Python and R. So make sure you check out pages 42-79 of the tutorial and go through the examples given there step by step.

Required Readings:

Jürgens, P. & Jungherr, A. (2016). A tutorial for using twitter-data in the social sciences: Data collection, preparation, and analysis. Social Science Research Network (SSRN). doi:10.2139/ssrn.2710146. (pp. 42-67).

Background Readings:

Box-Steffensmeier, J. M., Freeman, J. R., Hitt, M. P., & Pevehouse, J. C. W. (2014). Time series analysis for the social sciences. New York, NY: Cambridge University Press.
Chang, W. (2018). R graphics cookbook (2nd ed.). Sebastopol, CA: O’Reilly Media.
Kabacoff, R. I. (2015). R in action: Data analysis and graphics with r (2nd ed.). Shelter Island, NY: Manning Publications Co.
Wickham, H. (2016). Ggplot2: Elegant graphics for data analysis (2nd ed.). New York: Springer.
Wickham, H. & Grolemund, G. (2017). R for data science. Sebastopol, CA: O’Reilly Media.

Week 8—Week 10

back

Using Digital Trace Data in the Social Sciences, University of Konstanz (Summer 2018)

Instructor: Andreas Jungherr