Data mining on Twitter
Data mining in twitter
Wei Wang
2015-03-25
As we all know, the most popular social websites like Facebook and LinkedIn require the mutual acceptance of a connection between users (which usually implies a real-world connection of some kind), Twitter’s relationship model allows you to keep up with the latest happenings of any other user. Twitter’s following model is simple. It is the asymmetric following model that casts Twitter as more of an interest graph than a social network.
Think of an interest graph as a way of modeling connections between people and their arbitrary interests. Interest graphs provide a profound number of possibilities in the data mining realm that primarily involve measuring correlations between things for the objective of making intelligent recommendations and other applications in machine learning.
For example, you could use an interest graph to measure correlations and make recommendations ranging from whom to follow on Twitter to what to purchase online to whom you should date. To illustrate the notion of Twitter as an interest graph, consider that a Twitter user need not be a real person; it very well could be a person, but it could also be just about anything else.
Data of interests
The public firehose of all tweets has been known to peak at hundreds of thousands of tweets per minute during events with particularly wide interest, such as presidential debates. Twitter’s public firehose emits far too much data to consider for the scope of this book and presents interesting engineering challenges, which is at least one of the reasons that various third-party commercial vendors have partnered with Twitter to bring the firehose to the masses in a more consumable fashion. That said, a small random sample of the public timeline is available that provides filterable access to enough public data for API developers to develop powerful applications.
Creating a Twitter API connection
Before you can make any API requests to Twitter, you’ll need to create an application at https://dev.twitter.com/apps. Creating an application is the standard way for developers to gain API access and for Twitter to monitor and interact with third-party platform developers as needed. The process for creating an application is just read-only access to the API.
For simplicity of development, the key pieces of information that you’ll need to take away from your newly created application’s settings are its
- consumer key
- consumer secret
- access token
- access token secret
The four OAuth (a means of allowing users to authorize third-party applications to access their account data without needing to share sensitive informatio) fields are what you’ll use to make API calls to Twitter’s API.
Finding out what people are talking about
Inspecting the trends available to us through the GET trends/place resource. While you’re at it, go ahead and bookmark the official API documentation as well as the REST API v1.1 resources, because you’ll be referencing them regularly as you learn the ropes of the developer-facing side of the Twitterverse.
Install twitter package in python
Suppose we have created a folder ipython, open an terminal go to the directory
sudo pip install twitter
Launch the ipython notebook
Stay in the terminal, launch the notebook webconsole.
ipython notebook --pylab inline
Fire up Python
Let’s try to give our credentials to twitter and start a search
import twitter
% XXX: Go to http://dev.twitter.com/apps/new to create an app and get values
% for these credentials, which you'll need to provide in place of these
% empty string values that are defined as placeholders.
% See https://dev.twitter.com/docs/auth/oauth for more information
% on Twitter's OAuth implementation.
CONSUMER_KEY = ''
CONSUMER_SECRET = ''
OAUTH_TOKEN = ''
OAUTH_TOKEN_SECRET = ''
auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
CONSUMER_KEY, CONSUMER_SECRET)
twitter_api = twitter.Twitter(auth=auth)
% Nothing to see by displaying twitter_api except that it's now a
% defined variable
print twitter_api
The results of this example should simply display an unambiguous representation of the twitter_api object that we’ve constructed, such as:
<twitter.api.Twitter object at 0x111270c50>
Exploring Trending Topics
Ipython notebook code is given below # The Yahoo! Where On Earth ID for the entire world is 1. # See https://dev.twitter.com/docs/api/1.1/get/trends/place and # http://developer.yahoo.com/geo/geoplanet/
WORLD_WOE_ID = 1
US_WOE_ID = 23424977
# Prefix ID with the underscore for query string parameterization.
# Without the underscore, the twitter package appends the ID value
# to the URL itself as a special case keyword argument.
world_trends = twitter_api.trends.place(_id=WORLD_WOE_ID)
us_trends = twitter_api.trends.place(_id=US_WOE_ID)
print world_trends
print
print us_trends
You should see a semireadable response that is a list of Python dictionaries
from the API
(as opposed to any kind of error message), such as the following truncated results, before proceeding further.
[{u'created_at': u'2013-03-27T11:50:40Z', u'trends': [{u'url': u'http://twitter.com/search?q=%23MentionSomeoneImportantForYou'...
Notice that the sample result contains a URL for a trend represented as a search query that corresponds to the hashtag #MentionSomeoneImportantForYou, where %23
is the URL encoding
for the hashtag symbol
. We’ll use this rather benign hashtag throughout the remainder of the chapter as a unifying theme for examples that follow.
The pattern for using the twitter module
Simple and predictable:
- instantiate the Twitter class with an object chain corresponding to a base URL
- invoke methods on the object that correspond to URL contexts.
For example
twitter_api._trends.place(WORLD_WOE_ID) initiates an HTTP call to GET https://api.twitter.com/1.1/trends/place.json?id=1.
“Note the URL mapping to the object chain that’s constructed with the twitter package to make the request and how query string parameters are passed in as keyword arguments. To use the twitter package for arbitrary API requests, you generally construct the request in that kind of straightforward manner, with just a couple of minor caveats that we’ll encounter soon enough.”
Reformat the response to be more easily readable
Displaying API responses as pretty-printed JSON
“JSON is a data exchange format that you will encounter on a regular basis. In a nutshell, JSON provides a way to arbitrarily store maps, lists, primitives such as numbers and strings, and combinations thereof. In other words, you can theoretically model just about anything with JSON should you desire to do so.”
import json
print json.dumps(world_trends, indent=1)
print
print json.dumps(us_trends, indent=1)
An abbreviated sample response from the Trends API produced with json.dumps would look like the following:
[
{
"created\_at": "2013-03-27T11:50:40Z",
"trends": [
{
"url": "http://twitter.com/search?q=%23MentionSomeoneImportantForYou",
"query": "%23MentionSomeoneImportantForYou",
"name": "\#MentionSomeoneImportantForYou",
"promoted\_content": null,
"events": null
},
...
]
}
]
let’s use Python’s set
data structure to automatically compute this for us. In this instance, a set refers to the mathematical notion of a data structure that stores an unordered collection of unique items and can be computed upon with other sets of items and setwise operations.
Computing the intersection of two sets of trends
How to use a Python list comprehension to parse out the names of the trending topics from the results that were previously queried, cast those lists to sets, and compute the setwise intersection to reveal the common items between them. Keep in mind that there may or may not be significant overlap between any given sets of trends, all depending on what’s actually happening when you query for the trends. In other words, the results of your analysis will be entirely dependent upon your query and the data that is returned from it.
world_trends_set = set([trend['name']
for trend in world_trends[0]['trends']])
us_trends_set = set([trend['name']
for trend in us_trends[0]['trends']])
common_trends = world_trends_set.intersection(us_trends_set)
print common_trends