Collecting the Twitter Data for a Company between two Dates

This is the program uses stream api and downloads the tweets related to companies from the twitter account. The downloaded tweets are stored in JSON file format.

import simplejson as json
import urllib
import datetime
from datetime import timedelta
import oauth2

In order to access the Twitter account various keys are used such as consumer key, token key, and consumer secret and token secret. These keys stored in the file. The filename is ‘keys.txt’. Sample format for keys.txt is
consumerKey=xxxx
consumerSecret=xxxx
tokenKey=xxxxx
tokenSecret=xxxxx
The user is required to mention the above details in keys.txt and this file is kept in folder ‘twitterAcKeys’.
The name of the companies whose tweets have to be downloaded are mentioned in a file ‘companyAbbvs.txt’ and it is stored in folder ‘companyAbbvs’. The company abbreviations are stored in this file. Following is an example of ‘companyAbbvs.txt’.
AAPL
MCD
IBM
MSFT
SNE
GE
WMT
FDX
After fetching the data from twitter, tweets are stored as JSON file in ‘tweetsForAllCompany.json’ in folder ‘twitterData’.
folder and file details

keyFldr='twitterAcKeys'
keyFn='keys.txt'
twitterDataFldr='twitterData'
twitterDataJsonFn='tweetsForAllCompany.json'
companyAbvFldr='companyAbbvs'
companyAbvFn='companyAbbvs.txt'

The user must provide the starting date and end date. During this time period the tweets are fetched.
provide the date

fromDateDataToFetch='2018-11-15'
tillDateDataToFetch='2018-11-09'

Read all the abbreviations of company names from file.

finCmp=open(companyAbvFldr+'/'+companyAbvFn,'r')
companyAbbv=finCmp.read().split('\n')
finCmp.close()

Collect authentication keys from a file

finK=open(keyFldr+'/'+keyFn,'r')
keysAll=finK.read().split('\n')
finK.close()

Create a python dirctory keysAll by assigning values reading from the file ‘keys.txt’.
consumerKey=xxxx
consumerSecret=xxxx
tokenKey=xxxxx
tokenSecret=xxxxxx

keys={}
for ln in keysAll:
    a,b=ln.split('=')
    keys[a]=b

The user specifies a period of dates in terms of the from date and till date. Initialize the two date objects using the date given by the user.
Two date objects fromDate and tillDate are created. The difference between these two days provides duration in terms of days.
An array howLogn is used to store consecutive date starting from fromDate to tillDate. For example

howLong contain following dates

[‘2018-11-15’,
‘2018-11-14’,
‘2018-11-13’,
‘2018-11-12’,
‘2018-11-11’,
‘2018-11-10’,
‘2018-11-09’]
The consecutive dates are then stored in a dictionary i.e. startEndDate. Later, the tweet date between startEndDate is gathered using getDataBwDates().
Function to collect twitter data, takes keyword as company name, given date and time duration

def obtainTwitterData(keyword,fromDateDataToFetch,tillDateDataToFetch):
    fromDate= datetime.datetime.strptime(fromDateDataToFetch,'%Y-%m-%d')
    tillDate=datetime.datetime.strptime(tillDateDataToFetch,'%Y-%m-%d')
    duration=fromDate-tillDate
    howLong = []
    howLong.append(fromDate.strftime("%Y-%m-%d"))
    for i in range(1,duration.days+1):
        dateDiff = timedelta(days=-i)
        newDate = fromDate + dateDiff
        howLong.append(newDate.strftime("%Y-%m-%d"))
    tweetsForDuration = {}
    for i in range(0,duration.days):
        startEndDate = {'since': howLong[i+1], 'until': howLong[i]}
        #collect tweets between given dates
        tweetsForDuration[i] = getDataBwDates(keyword, startEndDate)        
        print(tweetsForDuration[i])
    return tweetsForDuration

The data from Twitter can be obtained using a URL. The URL consists of the address of the Twitter API to perform the query search.
The format for the URL query is ‘https://api.twitter.com/1.1/search/tweets.json?’
Then company abbreviation, language, result type, since, the maximum number of tweets, to include twitter entities, the start of date and end of the date.
An example of the query is
url=’https://api.twitter.com/1.1/search/tweets.json?q=%24SNE&lang=en&result_type=mixed&since_id=2014&count=10&include_entities=0&since=2018-11-12&until=2018-11-17′
On submitting this query to twitter API, it will perform the user authentication using the keys provided. Then, the response and content are retrieved.
From the URL content load the JSON instance into Python object. Later, the content is checked for errors and statues. On URL content success, various fields such as created_at, text are collected into a list.

def getDataBwDates(keyword, startEndDate = {}):
    maximumTweets = 10
    url = 'https://api.twitter.com/1.1/search/tweets.json?'
    query = {'q': keyword, 'lang': 'en', 'result_type': 'mixed', 'since_id': 2014,'count': maximumTweets, 'include_entities': 0}
    if startEndDate:
        for key, value in startEndDate.items():
            query[key] = value
    url += urllib.parse.urlencode(query)
    response,content = twitterAuthentication(url,keys)
    jsonUrlCont = json.loads(content)
    textTweets = []
    if 'errors' in jsonUrlCont:
        print ("Error while search query API")
        print (jsonUrlCont['errors'])
    else:
        for item in jsonUrlCont['statuses']:
            d = datetime.datetime.strptime(item['created_at'], '%a %b %d %H:%M:%S +0000 %Y')
            str = d.strftime('%Y-%m-%d')+" | "+item['text'].replace('\n', ' ')
            textTweets.append(str)
    return textTweets

Using oauth2 framework the twitter credentials are authenticated. Then the query is requested using the URL for the response and content.
Twitter authentication using keys and fetch the data

def twitterAuthentication(url,keys,http_method="GET", post_body=None,http_headers=None):  
  consumer = oauth2.Consumer(key=keys['consumerKey'], secret=keys['consumerSecret'])
  token = oauth2.Token(key=keys['tokenKey'], secret=keys['tokenSecret'])
  client = oauth2.Client(consumer, token)
  response, content = client.request(url, method=http_method, body=bytes('', "utf-8"), headers=http_headers)
  return response,content

A dictonary is created to collect all the data from Twitter. The company abbreviation is appended with ‘$’ to make a search URL. For example $AAPL.

allCompanyTweets ={}
for ci in range(len(companyAbbv)):
    companyAbbv_toSrch = '$'+companyAbbv[ci]    
    tweetsForACompany = obtainTwitterData(companyAbbv_toSrch,fromDateDataToFetch,tillDateDataToFetch)
    allCompanyTweets[companyAbbv[ci]]=tweetsForACompany
    print ("Tweets for the company "+companyAbbv_toSrch+" are fetched\n")

Print the collected tweets for companies

for ci in range(len(companyAbbv)):
    print("Tweets for company "+companyAbbv[ci])
    for tin in range(len(allCompanyTweets[companyAbbv[ci]])):
        print(allCompanyTweets[companyAbbv[ci]][tin])

Write the tweets for all companies into a json file twitterDataJsonFn.

with open(twitterDataFldr+'/'+twitterDataJsonFn,'w') as f:
    json.dump(allCompanyTweets,f)
f.close()

 

Leave a Reply

Your email address will not be published. Required fields are marked *