CES Scorecard - Playing with DataSift and Redis

So, this has been a fun little project! If you somehow came here first, go check out my CES Scorecard. You can also find the source code on Github.

As I write this we are in the thick of CES 2014 and each year blogs, news organizations, and tech review sites try to publish as many articles they can as fast as they can. I was curious, in all the hurry to publish, which news and blogging sites were actually writing stuff people found informative enough to share. What you're looking at is a bar graph showing how often (since 6am, Jan 9) various domains have had links to them posted anywhere public on Facebook, Twitter, or Reddit. Each time a new post contains a link, I snag it (more on how in a moment), parse out the domain, and increment a counter in a Redis Cloud database. So I'm keeping score of every public post and putting the results up for you to see.

How could I accomplish such an unfathomable feat as watching EVERY Twitter post, Facebook story, and Reddit submission? I can't. That sort of information storage would take massive infrastructure and would be far more daunting to set up, configure, and store than I could fathom. Instead, I'm using a service called DataSift, which does all the heavy lifting for me. Here's a peek at the query I set up:

interaction.content contains_any "CES 2014,CES"
AND
language.tag in "en"
AND
links.normalized_url exists

This gives me a stream of all the posts which contain either "CES" or "CES 2014" in the text or body, which is admittedly a bit looser than you'd likely want for real research. Then it confirms that the post has an understandable URL and is in English (for simplicity's sake I left out non-english sites). Once I've got a "stream" I can watch it live and see data start coming in right away. But while fun, it's not that useful to just watch - we want to do something! DataSift has all kinds of data destinations, including dumping directly into a database, file transfers, etc. As this was a quick project (and I am working with relatively small data sets) I opted for the HTTP-post method. DataSift sends a JSON array (at regular intervals or real-time, limited in size) with all the actual interactions that match your query. I just had to write a simple webhook to take the JSON POST, extract the URL, and store it to Redis. In Ruby, I did it like this:

def webhook   
    # Make sure this post contains stuff in the format we expect, an "interactions"
    #   element with an array of individual interactions.
    if params[:interactions]
        params[:interactions].each do |iac|
            # Confirm there's a link, and that link has a URL.
            if iac[:links] and iac[:links][:url]
                # Use the Ruby URI parser to snag the host
                dom = URI.parse(iac[:links][:url].first).host

                # Increment my Redis data store by one for the host I just found. 
                # ZINCRBY is a Redis command for "Find key and increment by X.
                #   Create it if it doesn't exist yet."
                $redis.zincrby("site_counts", 1, dom)
            end
        end
    end
    respond_to do |format|
        # Respond with success per Datasift documentation
        format.json{
            render :json => Hash["success" => true].to_json
        }
    end
end

Pretty slick, huh? No database write actions hungrily chewing up responsiveness, either. The write times for Redis's atomic interactions are trivial. I'm throwing these all into a sorted set, and so creating a high score list is a mater of snagging the top N elements (in this case, @leader_names gets an array of members 0 to 19 of the "site_counts" key) :

@leader_names = $redis.zrevrange("site_counts",0,19)
@leader_results = Hash.new
@leader_names.each do |leader|
    @leader_results[leader] = $redis.zscore("site_counts", leader)
end

On the front end, I have a simple Highcharts bar chart to show the results. You'll note the lazy direct insertion of Rails variables for the category labels and data series. One could certainly extend this with real-time updating of the data, but as this was just an excuse to play around with Redis and DataSift, I took a shortcut.

$(function () {
    $('#container').highcharts({
        chart: {
            type: 'bar'
        },
        title: {
            text: "Who's Winning CES 2014 Coverage?"
        },
        subtitle: {
            text: 'Source: DataSift.com'
        },
        xAxis: {
            categories: <%= raw @leader_names -%>,
            title: {
                text: "Website"
            }
        },
        yAxis: {
            min: 0,
            title: {
                text: 'Number of mentions (Twitter, Facebook, Reddit)',
                align: 'high'
            },
            labels: {
                overflow: 'justify'
            }
        },
        tooltip: {
            valueSuffix: ' mentions',
            pointFormat: '{point.y}'
            
        },
        plotOptions: {
            bar: {
                dataLabels: {
                    enabled: true
                }
            }
        },
        credits: {
            enabled: false
        },
        series: [{
            showInLegend: false,
            data: <%= @series -%>
        }]
    });
});

And there you have it! A pretty elegant way to answer “Who’s getting shared the most?” during CES. I’ll leave this query running on DataSift for a few days after CES (or until my free Redis Cloud account fills up). The Heroku app will stay up indefinitely, but it'll stop updating at some point relatively soon.

What would YOU do if you could easily sort, filter, and tabulate every Twitter, Facebook, or Reddit post? What about a bunch of other services, and you could perform sentiment and demographic analysis on them?

DataSift lets you do just that! They deal with infrastructure, you just pay for the complexity of the queries you run.

comments powered by Disqus