Our Methodology behind “Three Categories of News: Bread and Butter Issues, Major Headlines, and Third Strand Stories”
Discovering topics in Tweets is not an easy task for humans. According to Brandwatch, the platform we used for our analysis, there are more than 500 million Tweets per day or more than 6,000 Tweets every second. Given this abundance of data, we took some steps to make it easier to track the topics in the Tweets that we studied for this report.
1. Using Brandwatch data, we gathered Tweets that included references to Donald Trump and Joe Biden in the various forms they appear (e.g., Joe Biden, Biden, @JoeBiden, Trump, Donald Trump, President Trump, @realDonaldTrump). We also used Tweets that included hashtags like #Democrats2020 and #DonaldJTrump2020.
2. We studied millions of Tweets between January – April 2020, using a random sample of one percent of the retrieved Tweets. Monthly totals for Tweets were 584,000 (January), 558,000 (February), 810,000 (March), and 712,000 (April).
3. We cleaned the data to remove duplicate Tweets. This left us with 276,699 (January), 272,290 (February), 403,187 (March), and 360,023 (April) Tweets. We also dropped special characters like ‘$’ and ‘#’ and user names from the Tweets.
4. Sometimes, Tweets are written and posted by computer programs called bots. For this analysis, we did not have a method to separate Tweets by bots versus Tweets by humans, so our data includes all cleaned Tweets for both categories.
5. We used machine-based methods called topic models to spot the topics in Tweets. There are many versions of topic models. Our version defines a “topic” as the set of words that appear together as pairs most often in the Tweets.
6. Topic models depend on people to tell them the number of topics to spot. Researchers use their experience to arbitrarily set the number of topics that the computer system should attempt to identify. Setting the number too low will produce topics that are exceptionally broad. Setting the number too high produces topics with a very narrow focus. We went in steps of 3, starting from a small set of 5 topics and moving up to 20 topics.
7. The topics that we identified are, in a sense, the top ones. “Top” is defined as pairs of words that appear most often in more Tweets than all other possible pairs of words. “All possible pairs” is defined as the set of all pairs of words that can be formed from all the words in all the analyzed Tweets. There were many additional topics that didn’t rise to the level of the Top 20.
8. After reviewing the Top 20 topics, we selected 14 topics, so we didn’t result in too many narrow or too few broad topics. Even then, some of the topics can be difficult to pin down. The topics ‘???’ in the charts are examples of these.
9. Each topic has its own bar chart in the collage of bar charts for each month. The length of each bar shows the probability that the word belongs to a topic.
10. The labels for the topics in our analysis are assigned by the researcher using subjective judgment. This is based on the collection of words in each topic by the topic model and the context of news items where those words may have appeared during the month. The researcher then makes an overall assessment of what the topic could be.