Mathjax

Wednesday, October 2, 2013

Twitter archives for research

Twitter is a source of massive, meta-data rich, and multi-lingual textual feeds. For researchers seeking a large set of tweets for analysis, they will eventually be able to access the Library of Congress' archive. There are also commercial sources that maintain a database. However, using Twitter's sampling streaming API is free and straight-forward.

First, you need to get a Twitter account. Second, using dev.twitter.com, you need to create a new application. With the new application, Twitter will assign you four codes: a consumer key, a consumer secret, access token, and an access token secret. You will need these four codes for authentication.

For my sampling application, I used Python and an API called tweepy. Tweepy's streaming example is almost exactly what I needed, but in order to receive the broadest possible content I modified the script to:

 #!/usr/bin/env python  
 # Stream tweets -- https://github.com/joshthecoder/tweepy/blob/master/examples/streaming.py  
   
 from tweepy.streaming import StreamListener  
 from tweepy import OAuthHandler  
 from tweepy import Stream  
   
 # Go to http://dev.twitter.com and create an app.  
 # The consumer key and secret will be generated for you after  
 consumer_key="xxxxx"  
 consumer_secret="xxxxx"  
   
 # After the step above, you will be redirected to your app's page.  
 # Create an access token under the the "Your access token" section  
 access_token="xxxxx"  
 access_token_secret="xxxxx"  
   
 class StdOutListener(StreamListener):  
   """ A listener handles tweets are the received from the stream.  
   This is a basic listener that just prints received tweets to stdout.  
   
   """  
   def on_data(self, data):  
     print data  
     return True  
   
   def on_error(self, status):  
     print status  
   
 if __name__ == '__main__':  
   l = StdOutListener()  
   auth = OAuthHandler(consumer_key, consumer_secret)  
   auth.set_access_token(access_token, access_token_secret)  
   
   stream = Stream(auth, l)  
   stream.sample()  
   
Run the program with:
$ python streamtweets.py > tweets.json

JSON-encoded tweets will be written to the file. Kill the script when you have a sufficient quantity.

No comments:

Post a Comment