Twitter Data for Research – Issues of Data Retrieval, Management, and Processing – Twitter Data Access



Twitter Data for Research – Issues of Data Retrieval, Management, and Processing – Twitter Data Access

0 1


SAPOR_2014_Twitter

"Twitter for Researchers" presentation for SAPOR 2014

On Github rchew / SAPOR_2014_Twitter

Using Twitter Data for Research

Issues of Data Retrieval, Management, and Processing

Rob Chew , Annice Kim , Paul Ruddle , Clay Heaton

RTI International

The Social Media Landscape

Opinion has never been this cheap, timely, or abundant

  • Harder to acquire data from traditional survey methods
    • falling response rates
    • inadequate sample frames

Huge opportunity for survey researchers to harness insight with social media

  • Huge opportunity for social scientists / survey researchers to understand what people are thinking, discover how information spreads throughout social networks, and track the evolution of important topics / events over time.
  • With the continuing trend of falling response rates and inadequate sampling frames, the potential for harnessing insight that’s cheap and timely without burdening respondents is an enticing proposition.

Twitter Data Access

Three main ways to get Twitter data:

Social Media Monitoring

(Radian6, Crimson Hexagon, etc.)

  • Ideal for tracking brands/issues in real-time, used by large companies and PR firms to address press in a timely fashion.
  • Has build-in tools to help handle volume, workflow, and reporting.
  • Offers in-house technical support for questions.
  • Often expensive subscription model but covers other social media outlets as well
  • As with most “out of the box” tools, analyses limited by the scope of the tools made available
  • Methodology not always transparent
  • Usually, limited data range (3 months)
  • May or may not have “Firehose” access

Twitter Authorized Resellers

(GNIP, Datasift)

  • Allows access to the Twitter “Firehose”, the totality of tweets on Twitter
  • More of a data vendor (DaaS) than a social media platform
  • Have access to historic data Offers in-house technical support for questions.
  • Often expensive
  • Often requires infrastructure to analyses information yourself

Twitter API

  • Application Programming Interface
  • An interface that allows you to directly connect and interact with an application (in our case, Twitter), instead of needing to read a webpage like a human does.
  • Can be thought of as a “menu” that shows the different data the applications can return.
  • Allows access to Twitter data directly
  • Provides access to both historic REST API data and Streaming “1%” API
  • Free!
  • API limits on the number of tweets/user profiles that are allowed to access at one time.
  • Requires some technical/programming knowledge
  • No support (other than the internet and Twitter documentation)
  • Explorer Tool

Twitter API: REST vs. Streaming

  • REST API - To get information, user must specifically request it
    • Well over 50 different REST API "Resources"
  • Streaming API - Once request is made, provides continuous stream of updates without further input from user (up to 1% full twitter stream)
    • Has "Public", "User", and "Site" streams
  • REST (Representational StateTransfer)
    • “Pull” strategy for data retrieval
    • To get information, user must specifically request it
    • GET / POST
  • Streaming
    • “Push” strategy for data retrieval
    • Once request made, provides continuous stream of updates w/o further input fromuser
    • Three end point types:
      • Public - public tweets on twitter
      • User - single user stream
      • Site - multi-user streams intended for apps which accessTweets from multiple users

Twitter Data

  • Users
    • Users can be anyone or anything. They tweet, follow, have a timeline, can be mentioned, and can be looked up in bulk.
    • Attributes: User name, twitter handle, location, profile description, follower/friends/tweet counts, verified, profile creation date
  • Tweets
    • Tweets are the basic atomic building block of all things Twitter. Tweets, also known more generically as “status updates.
  • Entities
    • Entities provide metadata and additional contextual information about content posted on Twitter. Entities are never divorced from the content they describe. In API v1.1, entities are returned wherever tweets are found in the API.
    • Hashtags, URLs, media, user mentions
  • Place
    • Places are specific, named locations with corresponding geo coordinates. They can be attached to Tweets by specifying a place_id when tweeting. Tweets associated with places are not necessarily issued from that location but could also potentially be about that location. Places can be searched for.
  • Many API calls available depending on what you’re interested in (how many?)
  • https://dev.twitter.com/overview/documentation
  • Most common types of info accessible from the API:
    • Information about a user (REST)
    • A user’s network consisting of his connections (REST)
    • Tweets published by a user (REST, Streaming)
    • Search results on Twitter (REST, Streaming)
    • Location of Tweets

Information about a user (REST)

  • Twitter Object: User
  • Input: List of usernames (user_id or handle)
  • 180 API calls per single user / 15 minutes

A user’s network consisting of his connections (REST)

  • Followers
    • Twitter Object: User
    • Input: List of usernames (user_id or handle)
    • 15 API calls per single user / 15 minutes
  • Friends
    • Twitter Object: User
    • Input: List of usernames (user_id or handle)
    • 15 API calls per single user / 15 minutes

Tweets published by a user (REST, Streaming)

  • REST
    • Twitter Object: Tweets
    • Input: List of usernames (user_id or handle)
    • 180 API calls per single user / 15 minutes.
    • Up to 200 tweets collected per call, up to 3200 per timeline.
  • Streaming
    • Twitter Object: Tweets
    • Input: List of usernames (user_id or handle)
    • Allowed up to 5,000 Twitter userids
    • Only captures public tweets

Search results on Twitter (REST, Streaming)

  • REST
    • Twitter Object: Tweets
    • Input: Queries
    • 180 API calls per single user / 15 minutes
    • Tweets from previous 10 days
  • Streaming (Needs Updating)
    • Twitter Object: Tweets
    • Input: Queries
    • 15 API calls per single user / 15 minutes
    • Tweets from previous 10 days

Location of Tweets

  • ~1% of tweets are have geolocation data

Connecting to Data

  • First, need authentication to connect to Twitter (OAuth)
  • Next, need a general purpose programming language to talk with both the API and your database
    • Ex: Python, PHP, JavaScript/Node, Ruby
  • Many libraries available to ease in connecting to the API
  • Likewise, you’ll need a library to connect to your database
    • Ex: PyMongo, to integrate with MongoDB
  • First, need authentication to connect to Twitter (OAuth)
    • Next, need a general purpose programming language to talk with both the API and your database
      • I prefer Python because of the ease of readability, and because one of its data structures (dictionaries) closely mimics the storage format of Twitter data
    • Many libraries available to ease in connecting to the API https://dev.twitter.com/overview/api/twitter-libraries
      • Support in multiple languages available
      • Ex: Twython
    • Likewise, you’ll need a library to connect to your database
      • MongoDB has great support for Python through its supported package, PyMongo

Twitter Data Storage

  • Data stored as JSON (JavaScript Object Notation)
    • Key-Value pair
    • Allows for nesting of fields and is flexible
    {u'_id': ObjectId('53d11ddd28975720fa77c8aa'),
     u'contributors': None,
     u'coordinates': None,
     u'created_at': u'Thu Jul 24 14:10:49 +0000 2014',
     u'favorite_count': 2,
     u'favorited': False,
     u'geo': None,
     u'id': 492310941657481216L,
     u'id_str': u'492310941657481216',	
     u'lang': u'en',
     u'place': None,
     u'retweet_count': 0,
     u'retweeted': False,
     u'text': u'i need to doze off before i doze off \n\U0001f634\U0001f634',
     u'truncated': False}
Based on tweet data structure and volume, NoSQL databases are a great storage solution Due to the learning curve, other output types might be more practical (ex: csv files)

NoSQL example: MongoDB

  • Document-Oriented Storage
  • Index Support
  • Straightforward Queries
  • Speed
  • Document-Oriented Storage
    • MongoDB stores its data in JSON-style objects.This makes it very easy to store raw documents from Twitter’s APIs.
  • Index Support
    • MongoDB allows for indexes on any field, which makes it easy to create indexes optimized for your application.
  • Straightforward Queries
    • MongoDB’s queries, while syntactically much different from SQL, are semantically very similar. In addition, MongoDB supports MapReduce, which allows for easy lookups in the data.
  • Speed
    • Figure below shows a comparison of query speed between the relational model and MongoDB.

Analysis

  • Sentiment Analysis
  • Probabilistic Topic Modeling
  • Ideological Scaling
  • Text Clustering
  • Classification
  • Entity relation modeling (i.e., learning relations between named entities)
  • Social Network Analysis
  • http://stanford.edu/~jgrimmer/tad2.pdf

Complications!

  • Rate limits (window is 15 minutes)
    • Depending on analysis, either need multiple machines or be willing to wait
  • Public vs. Private
    • Can get profile information on almost any user; however, tweets are restricted if a user decides to be private
  • Non-trivial Data Preparation
    • Tokenization, lemmas / stemming, case folding, POS tagging, twitter entities (RT, handles, mentions, hyperlinks),emoticons, slang, sparse data
  • Error codes and Responses

Conclusion

  • Although collecting, storing, and analyzing Twitter data can be complicated, surveys are also complicated and labor intensive
  • Twitter data is not perfect - for instance, Twitter may decide to pull the plug on the public API, leaving everybody in a lurch.
  • Nonetheless, with the continuing trend of falling response rates and inadequate sampling frames, getting cheap and timely data without burdening respondents is an enticing proposition.

Questions? Feel free to reach out to rchew@rti.org