However it has a few flaws.
For Twitter, the benefit of a Streaming API is probably one of scalability. Instead of us using the old REST API to ask for specific data and causing tens of thousands of data look ups, all they do is give us the end of their own data stream once it has been used in house and is now halfway across the back garden. All they have to do is allow us to filter the stream a bit to make it a bit more relevant to our needs and put an absolute cap on the throughput (about 1% for most of us.)
This looks good. 1% is enough for most development needs and streams down your connection like a low bandwidth radio station. I don't really know what the bandwidth or download is, but it's not much. Once we've developed our new and wonderful website, then we can ask, or possibly pay, Twitter to turn up the pressure a bit.
So, now let's look at the filters.
There are several ways that the stream can be filtered
- follow - filter by userid
- track - filter by keyword
- location - filter by geographic location
- retweets - just the retweets ma'am
- links - only tweets containing a link
- random - I think they just mean unfiltered
My own first idea was inspired by the M5 motorway accident just a few miles from where I live and astounded that even in this day and age, the scale of the incident was only uncovered somewhat slowly. Surely what the quantity and content of the tweets from the people who were NOT in the incident itself, would help scale the incident? So what I wanted to do was:
- Listen to what people Tweet at known traffic jam locations.
- Identify some fingerprint of common words, maybe "traffic, jam, standstill, miles" or whatever.
- Look for clusters of these words near to motorways.
- Plot the clusters based on the location of the phones that made the tweets.
I don't know how Twitter do the filtering, but evidently it's based on something fairly broad brushed. I can live with that maybe, all I have to do is check the Tweets geo location, which is added if you tweet by most modern phones. I was expecting most of the useful tweets to be from a mobile anyway, so that would work if I can just get used to maybe 2% of the tweets actually being in the bounding box. 2% 0f 1% is after all only 0.02% of all Tweets or 1 Tweet in 5000.
So what happens if I assume that the word "traffic" will occur in the most useful Tweets? This is either bad science or common sense data filtering depending on how you look at it.
Alas, it appears that the Streaming API does not allow you to filter by location AND keyword! All you can do is do an OR filter, so I can filter the stream to include certain areas OR certain keywords, but not certain keywords within a certain location.
To me this just renders the API all but useless, but no doubt you lot are much smarter than I and will dazzle me with your great ideas.
Please let me know.