How to collect tweets outside a given bounding box?

by Seen   Last Updated June 13, 2019 04:22 AM

I checked the twitter API here.

I know how to get tweets in United States within a bounding box such as:

[-124.47,24.0,-66.56,49.3843].

How can I easily get all the tweets NOT in this bounding box?



Answers 2


Textually from the Twitter API documentation

If you would like to exclude place matches or only include places which fall completely within the bounding box, your code will have to perform an additional filtering step after reading the filtered stream.

Below the Radar
Below the Radar
June 28, 2013 17:03 PM

I suggest making a dataframe from raw tweet data file and then filter using coordinates as follows.

import pandas as pd
from tweet_parser.tweet import Tweet
from tweet_parser.tweet_parser_errors import NotATweetError
import fileinput
import json

CONVERT RAW TEXT FILE TO JSON FILE

with open('tweet_stream.txt') as infile, open('tweet_stream.json', 'w') as outfile:
    for line in infile:
        if not line.strip(): continue  # skip the empty line
        outfile.write(line)  # non-empty line. Write it to output

WRITE JSON FILE INTO A PANDAS DATAFRAME

df  = pd.DataFrame(columns=['DateTime','user_id','lat','long','tweet'])
for line in fileinput.FileInput("tweet_stream.json"):
    try:
        tweet_dict = json.loads(line)
        tweet = Tweet(tweet_dict)
    except (json.JSONDecodeError,NotATweetError):
        pass

    if tweet.geo_coordinates is None:
        df= df.append({'DateTime':tweet.created_at_datetime,'user_id':tweet.user_id,'lat':tweet.geo_coordinates,'long':tweet.geo_coordinates,'tweet':tweet.text},ignore_index=True)
    else:
        df= df.append({'DateTime':tweet.created_at_datetime,'user_id':tweet.user_id,'lat':tweet.geo_coordinates['latitude'],'long':tweet.geo_coordinates['longitude'],'tweet':tweet.text},ignore_index=True)

APPLY LOCATION FILTER

latmin = 24.0
latmax = 49.3843
longmin = -124.47
longmax = -66.56

df_geo = pd.DataFrame(columns=['DateTime','user_id','lat','long','tweet'])
geo = df['lat'].notnull() #filtering out non-geotagged tweets
df_geo = df[geo].reset_index(drop=True)
loc_filter = (df_geo['lat'] < latmin) & (df_geo['lat'] > latmax) & (df_geo['long'] < longmin) & (df_geo['long'] > longmax)
df_geo = df_geo[loc_filter].reset_index(drop=True)

Note that this code is able to filter only geotagged tweets (tweets which have a coordinates field as Type: Point) and disregards tweets which are streamed as a result of the overlap between the 'place' polygon and provided bounding box, eg.[-124.47,24.0,-66.56,49.3843].

I hope this helps.

Debjit Bhowmick
Debjit Bhowmick
June 13, 2019 03:38 AM

Related Questions


Updated January 19, 2018 16:22 PM

Updated January 19, 2018 20:22 PM

Updated April 04, 2016 08:09 AM

Updated April 04, 2016 08:09 AM

Updated March 01, 2016 01:09 AM