Using Twitter Data for Research

Issues of Data Retrieval, Management, and Processing

Rob Chew , Annice Kim , Paul Ruddle , Clay Heaton

RTI International

The Social Media Landscape

Opinion has never been this cheap, timely, or abundant

Harder to acquire data from traditional survey methods
- falling response rates
- inadequate sample frames

Huge opportunity for survey researchers to harness insight with social media

Huge opportunity for social scientists / survey researchers to understand what people are thinking, discover how information spreads throughout social networks, and track the evolution of important topics / events over time.
With the continuing trend of falling response rates and inadequate sampling frames, the potential for harnessing insight that’s cheap and timely without burdening respondents is an enticing proposition.

One of the most important social networks due to utilization and data access
Twitter users are only 18% of internet users and 14% of the overall adult population
Their demographic profile is not reflective of the full population
Nonetheless, Twitter data has been used effectively in correlating with flu outbreaks, movie box office sales, opinions about politics and the economy

Twitter Data Access

Three main ways to get Twitter data:

Social Media Monitoring Firms
Twitter Authorized Resellers
Twitter API

Social Media Monitoring

(Radian6, Crimson Hexagon, etc.)

Ideal for tracking brands/issues in real-time, used by large companies and PR firms to address press in a timely fashion.
Has build-in tools to help handle volume, workflow, and reporting.
Offers in-house technical support for questions.
Often expensive subscription model but covers other social media outlets as well
As with most “out of the box” tools, analyses limited by the scope of the tools made available
Methodology not always transparent
Usually, limited data range (3 months)
May or may not have “Firehose” access

Twitter Authorized Resellers

(GNIP, Datasift)

Allows access to the Twitter “Firehose”, the totality of tweets on Twitter
More of a data vendor (DaaS) than a social media platform
Have access to historic data Offers in-house technical support for questions.
Often expensive
Often requires infrastructure to analyses information yourself

Twitter API

Application Programming Interface
An interface that allows you to directly connect and interact with an application (in our case, Twitter), instead of needing to read a webpage like a human does.
Can be thought of as a “menu” that shows the different data the applications can return.
Allows access to Twitter data directly
Provides access to both historic REST API data and Streaming “1%” API
Free!
API limits on the number of tweets/user profiles that are allowed to access at one time.
Requires some technical/programming knowledge
No support (other than the internet and Twitter documentation)
Explorer Tool

Twitter API: REST vs. Streaming

REST API - To get information, user must specifically request it
- Well over 50 different REST API "Resources"
Streaming API - Once request is made, provides continuous stream of updates without further input from user (up to 1% full twitter stream)
- Has "Public", "User", and "Site" streams

REST (Representational StateTransfer)
- “Pull” strategy for data retrieval
- To get information, user must specifically request it
- GET / POST
Streaming
- “Push” strategy for data retrieval
- Once request made, provides continuous stream of updates w/o further input fromuser
- Three end point types:
  - Public - public tweets on twitter
  - User - single user stream
  - Site - multi-user streams intended for apps which accessTweets from multiple users

Twitter Data

Information about a user
- 18,000 profiles per 15 mins
A user’s network consisting of his connections (Friends, Followers)
- 7500 user ids per 15 mins
Tweets published by a user
- Up to 36,000 tweets per 15 mins
- Also available from Streaming API
Search results on Twitter
- Up to 18,000 tweets per 15 mins
- Also available from Streaming API
Location of Tweets
- Location for users/tweets embedded in objects (~1% complete)
- Can also search tweets for a specific area

Users
- Users can be anyone or anything. They tweet, follow, have a timeline, can be mentioned, and can be looked up in bulk.
- Attributes: User name, twitter handle, location, profile description, follower/friends/tweet counts, verified, profile creation date
Tweets
- Tweets are the basic atomic building block of all things Twitter. Tweets, also known more generically as “status updates.
Entities
- Entities provide metadata and additional contextual information about content posted on Twitter. Entities are never divorced from the content they describe. In API v1.1, entities are returned wherever tweets are found in the API.
- Hashtags, URLs, media, user mentions
Place
- Places are specific, named locations with corresponding geo coordinates. They can be attached to Tweets by specifying a place_id when tweeting. Tweets associated with places are not necessarily issued from that location but could also potentially be about that location. Places can be searched for.

Many API calls available depending on what you’re interested in (how many?)
https://dev.twitter.com/overview/documentation

Most common types of info accessible from the API:
- Information about a user (REST)
- A user’s network consisting of his connections (REST)
- Tweets published by a user (REST, Streaming)
- Search results on Twitter (REST, Streaming)
- Location of Tweets

Information about a user (REST)

Twitter Object: User
Input: List of usernames (user_id or handle)
180 API calls per single user / 15 minutes

A user’s network consisting of his connections (REST)

Followers
- Twitter Object: User
- Input: List of usernames (user_id or handle)
- 15 API calls per single user / 15 minutes
Friends
- Twitter Object: User
- Input: List of usernames (user_id or handle)
- 15 API calls per single user / 15 minutes

Tweets published by a user (REST, Streaming)

REST
- Twitter Object: Tweets
- Input: List of usernames (user_id or handle)
- 180 API calls per single user / 15 minutes.
- Up to 200 tweets collected per call, up to 3200 per timeline.
Streaming
- Twitter Object: Tweets
- Input: List of usernames (user_id or handle)
- Allowed up to 5,000 Twitter userids
- Only captures public tweets

Search results on Twitter (REST, Streaming)

REST
- Twitter Object: Tweets
- Input: Queries
- 180 API calls per single user / 15 minutes
- Tweets from previous 10 days
Streaming (Needs Updating)
- Twitter Object: Tweets
- Input: Queries
- 15 API calls per single user / 15 minutes
- Tweets from previous 10 days

Location of Tweets

~1% of tweets are have geolocation data

Connecting to Data

First, need authentication to connect to Twitter (OAuth)
Next, need a general purpose programming language to talk with both the API and your database
- Ex: Python, PHP, JavaScript/Node, Ruby
Many libraries available to ease in connecting to the API
- Ex: Twython
Likewise, you’ll need a library to connect to your database
- Ex: PyMongo, to integrate with MongoDB

First, need authentication to connect to Twitter (OAuth)
- Next, need a general purpose programming language to talk with both the API and your database
  - I prefer Python because of the ease of readability, and because one of its data structures (dictionaries) closely mimics the storage format of Twitter data
- Many libraries available to ease in connecting to the API https://dev.twitter.com/overview/api/twitter-libraries
  - Support in multiple languages available
  - Ex: Twython
- Likewise, you’ll need a library to connect to your database
  - MongoDB has great support for Python through its supported package, PyMongo

Twitter Data Storage

Data stored as JSON (JavaScript Object Notation)

Key-Value pair
Allows for nesting of fields and is flexible

{u'_id': ObjectId('53d11ddd28975720fa77c8aa'),
 u'contributors': None,
 u'coordinates': None,
 u'created_at': u'Thu Jul 24 14:10:49 +0000 2014',
 u'favorite_count': 2,
 u'favorited': False,
 u'geo': None,
 u'id': 492310941657481216L,
 u'id_str': u'492310941657481216',	
 u'lang': u'en',
 u'place': None,
 u'retweet_count': 0,
 u'retweeted': False,
 u'text': u'i need to doze off before i doze off \n\U0001f634\U0001f634',
 u'truncated': False}

Based on tweet data structure and volume, NoSQL databases are a great storage solution Due to the learning curve, other output types might be more practical (ex: csv files)

NoSQL example: MongoDB

Document-Oriented Storage
Index Support
Straightforward Queries
Speed

Document-Oriented Storage
- MongoDB stores its data in JSON-style objects.This makes it very easy to store raw documents from Twitter’s APIs.
Index Support
- MongoDB allows for indexes on any field, which makes it easy to create indexes optimized for your application.
Straightforward Queries
- MongoDB’s queries, while syntactically much different from SQL, are semantically very similar. In addition, MongoDB supports MapReduce, which allows for easy lookups in the data.
Speed
- Figure below shows a comparison of query speed between the relational model and MongoDB.

Analysis

Sentiment Analysis
Probabilistic Topic Modeling
Ideological Scaling
Text Clustering
Classification
Entity relation modeling (i.e., learning relations between named entities)
Social Network Analysis
http://stanford.edu/~jgrimmer/tad2.pdf

Complications!

Rate limits (15 minute window)
Public vs. Private
Non-trivial Data Preparation
Error codes and Responses

Rate limits (window is 15 minutes)
- Depending on analysis, either need multiple machines or be willing to wait
Public vs. Private
- Can get profile information on almost any user; however, tweets are restricted if a user decides to be private
Non-trivial Data Preparation
- Tokenization, lemmas / stemming, case folding, POS tagging, twitter entities (RT, handles, mentions, hyperlinks),emoticons, slang, sparse data
Error codes and Responses

Conclusion

Although collecting, storing, and analyzing Twitter data can be complicated, surveys are also complicated and labor intensive
Twitter data is not perfect - for instance, Twitter may decide to pull the plug on the public API, leaving everybody in a lurch.
Nonetheless, with the continuing trend of falling response rates and inadequate sampling frames, getting cheap and timely data without burdening respondents is an enticing proposition.

Questions? Feel free to reach out to rchew@rti.org

Twitter Data for Research – Issues of Data Retrieval, Management, and Processing – Twitter Data Access

rchew

Twitter Data for Research – Issues of Data Retrieval, Management, and Processing – Twitter Data Access

0 1

SAPOR_2014_Twitter

Using Twitter Data for Research

Issues of Data Retrieval, Management, and Processing

The Social Media Landscape

Twitter Data Access

Social Media Monitoring

Twitter Authorized Resellers

Twitter API

Twitter API: REST vs. Streaming

Twitter Data

Information about a user (REST)

A user’s network consisting of his connections (REST)

Tweets published by a user (REST, Streaming)

Search results on Twitter (REST, Streaming)

Location of Tweets

Connecting to Data

Twitter Data Storage

NoSQL example: MongoDB

Analysis

Complications!

Conclusion

Twitter Data for Research – Issues of Data Retrieval, Management, and Processing – Twitter Data Access

rchew

Twitter Data for Research – Issues of Data Retrieval, Management, and Processing – Twitter Data Access

0 1 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

SAPOR_2014_Twitter

Using Twitter Data for Research

Issues of Data Retrieval, Management, and Processing

The Social Media Landscape

Twitter Data Access

Social Media Monitoring

Twitter Authorized Resellers

Twitter API

Twitter API: REST vs. Streaming

Twitter Data

Information about a user (REST)

A user’s network consisting of his connections (REST)

Tweets published by a user (REST, Streaming)

Search results on Twitter (REST, Streaming)

Location of Tweets

Connecting to Data

Twitter Data Storage

NoSQL example: MongoDB

Analysis

Complications!

Conclusion

0 1