Search This Blog

Sunday 20 July 2014

Third Step: Popularity Comparison of Two Celebrities in Twitter Using R

Hey everyone...
Its time we appreciate the power of R.
In this post we will try to fetch the tweets of two famous personalities, teams etc in Twitter and try to analyze who is famous between two with respect to some common comparison parameters.

Oh one thing, I hope you already have done the handshake with Twitter using your credentials. If not, then please refer to my post "Second Step" to do that.

OK let's proceed now...

********************* Code In R to Accomplish The Mission ********************
#Get Tweets for a searchTerm
TweetFrame<- function (searchTerm,maxTweets)
{
  twtlist<-searchTwitter(searchTerm,maxTweets,cainfo="cacert.pem")
  return(do.call("rbind",lapply(twtlist,as.data.frame)))


}                                                                         #End of Function TweetFrame


#Function to do a popularity check
popularityCheck<-function(name1,name2,count)
{
  name1DF<-TweetFrame(name1,count)   
  name2DF<-TweetFrame(name2,count)  
 
  sortname1<-name1DF[order(as.integer(name1DF$created)),]
  sortname2<-name2DF[order(as.integer(name2DF$created)),]
 
  eventdelays1<-as.integer(diff(sortname1$created))
  eventdelays2<-as.integer(diff(sortname2$created))
 
  meanof1<-mean(eventdelays1)
  sumval1<-sum(eventdelays1<=round(meanof1,1))  #here val of sumval1 becomes the        

                                                                       #common ground of comparison
  res1<-poisson.test(sumval1,count)$conf.int
 
  meanof2<-mean(eventdelays2)
  sumval2<-sum(eventdelays2<=sumval1)               #hence sumval1 is used to compare.
  res2<-poisson.test(sumval2,count)$conf.int
 
  p1<-as.single(sumval1/count)
  p2<-as.single(sumval2/count)
 
  l1=as.single(res1[1])
  l2=as.single(res2[1])
  u1=as.single(res1[2])
  u2=as.single(res2[2])
  barplot2(c(p1, p2), ci.l = c(l1,l2), ci.u = c(u1,u2), plot.ci=TRUE,  

  names.arg=c(name1,name2))

}                                                                           #End of Function popularityCheck


******************************** END**********************************

At first, let's see what this code will do and then we will see how it did that...

Well in the past few days people were so much engrossed in FIFA 2014 that there was flood of posts and tweets. So why not conduct a popularity check on FIFA teams.  

Input

Team 1: Argentina (#argentina)
Team 2: Germany (#germany)
Number of Tweets extracted from Twitter: 500 (for each team)


>popularityCheck("#argentina","#germany",500)



Output



So, this is what we got. This plot clearly shows that on some comparison basis Argentina is more popular than Germany.

Now let us understand how we got this..

At first look at the TweetFrame(searchTerm,maxTweets) function. This function takes a "searchTerm" say #germany and 500 tweets as "maxTweets" in input and return 500 tweets in a list form. Hence the result is stored in a variable twtList. Now the content of twtList is very haphazard. To give it a proper tabular format we convert the list into a Dataframe and return it. 

Now let us look at the function popularityCheck(name1,name2,count) which is of more concern. name1 and name2 are the two search terms and count is no.of tweets we need to extract. 

If we look at the first two lines of the function it takes these terms and prepares two separate lists.

The next two lines sort the respective lists in order of arrival times of tweets. The latest tweet is kept first and so on.

The next two lines prepares two lists say eventdelays1 and eventdelays2 which keep the difference of arrival times..

Next we compute mean of eventdelays1 named as meanof1 and count the number of tweets that comes within the mean value... This becomes the ground for comparison and we find the number of tweets for next search term that came within meanof1. The count of tweets satisfying condition of meanof1 is kept in sumval1 and sumval2.

The next two lines compute the probabilities of tweets coming within meanof1. The values are stored in p1 and p2. l1 and u1 is the range which says that 95% of the tweets out of 500 with a desired mean of 'meanof1' lies in between this range. This goes for l2 and u2 as well.

And the godfather line is executed which is barplot(...) [please select the package gplots from package window and if it is not there then you can install writing "install.packages("gplots") ]with arguments shown above which are self explanatory. This command plots the graph and an instance is shown above. The plot shows that people are tweeting more about Argentina and comparatively less for Germany. Well this is it....

Well I hope it was of some worth spending your time... Please feel free to make any suggestions.

The next thing which i am going to do is build a word cloud in R using Tweets. And I feel fun doing it.. Till my next post, as I say,  Happy Learning...
 

No comments:

Post a Comment