| about me

Loving Python

Python is extremely easy to learn and besides a few warts a joy to program in.

I really like the 'pythonic' approach to focus on readibility over cleverness, ie. explicit is better than implicit. This makes it really easy to open up a python library and quickly understand what is going on. I can give up that little bit of expressiveness and flexibilty for highly readable code. I also really like the module system in python.

But enough about the language, because what is really drawing me into this world of python more than anything else is the ecosystem. The variety of its programmers, from sysadmins to scientists - not just web developers. And it shows in the libraries. Matplotlib, numpy, ipython notebook, scikit-learn are just INCREDIBLE!

In order to get better acquainted with some of these tools I decided to play with the reddit api. But really I just want to show how little code it takes to get your thoughts on to paper. To quickly visualise and explore and idea.

Lets grab some data

Wil Wheaton (wesley crusher on Star Trek TNG) is an active redditor. Let's grab his last 1000 comments posted and play around with that data.

We have to respect reddits 2 second rule so this takes ~20 seconds.

In [1]:
import requests, re, time

def get_all(max_depth=5,after=None):
    """recursive function to keep fetching pages (reddit limits to batches of 100)"""

    url = ""
    if after:
        url += "&after="+after

    print max_depth, url
    if r.json["data"]["after"] and max_depth > 0:
        return r.json["data"]["children"] + get_all(max_depth-1,r.json["data"]["after"])
        return r.json["data"]["children"]

data = get_all(15) # go get a quick coffee now

Lets inspect how many comments we got back and pull out a sample comment to inspect the JSON

In [3]:
import pprint
print len(data), "comments"
{u'data': {u'approved_by': None,
           u'author': u'wil',
           u'author_flair_css_class': None,
           u'author_flair_text': None,
           u'banned_by': None,
           u'body': u"I'm not sure how that happened, and just noticed it myself a few days ago.\n\nI'm not saying the most likely answer is aliens, but... Aliens.",
           u'body_html': u'<div class="md"><p>I'm not sure how that happened, and just noticed it myself a few days ago.</p>\n\n<p>I'm not saying the most likely answer is aliens, but... Aliens.</p>\n</div>',
           u'created': 1353202013.0,
           u'created_utc': 1353173213.0,
           u'downs': 0,
           u'edited': False,
           u'gilded': 0,
           u'id': u'c72ta9y',
           u'likes': None,
           u'link_id': u't3_13absb',
           u'link_title': u'Wil Wheaton comments on Google plus that he likes those You Choose stories and gets this tweeted to him later...',
           u'name': u't1_c72ta9y',
           u'num_reports': None,
           u'parent_id': u't1_c72nw8k',
           u'replies': None,
           u'subreddit': u'geek',
           u'subreddit_id': u't5_2qh17',
           u'ups': 4},
 u'kind': u't1'}

Ok, that gives us something to play with. Lets have a look at Wil's activity per subreddit. In other words lets pull out the number of posts by subreddit

In [58]:
import collections
subreddits = collections.Counter(map(lambda x:x["data"]["subreddit"], data))
sdata = subreddits.items()
sdata = sorted(sdata, key=lambda x: x[1], reverse=True)

pos = arange(0,len(sdata))
barh(pos, zip(*sdata)[1])

Informative but maybe a little too much information. lets filter out the ones where he has posted less than 5 times.

In [62]:
sfdata = [d for d in sdata if d[1] > 5]
pos = arange(0,len(sfdata))
barh(pos, zip(*sfdata)[1])

Popularity Contest (err... karma)

So now we have a basic overview of what wil gets up to. I have a hunch that even though he has posted less on star trek and scifi that those 2 subreddits will still garner him the most upvotes. After all he might not always be recognized in the other threads.

If we keep the subreddits in the same order as above and plot against relative upvotes we can quickly verify this theory

so for a given subreddit relative upvotes just means:

total upvotes / total comments

or avg votes per comment if you will

In [63]:
def upvotes(subreddit):
    return sum([x["data"]["ups"]/subreddits[x["data"]["subreddit"]] for x in data if x["data"]["subreddit"] == subreddit])

sudata = [ (d[0],upvotes(d[0])) for d in sdata if d[1] > 5]
pos = arange(0,len(sudata))
barh(pos, zip(*sudata)[1])

Suspicion confirmed, although I had forgotten about Pics and Iamas. Yet it makes perfect sense that these 2 would be high as well. The pattern clearly seems to be that Wil scores a ton of karma on his celebrity status. And the homebrewers, scotch drinkers, hockey players, and board gamers have no idea who he is.

I won't guess what is going on in /r/OperationGrabAss and will leave it to some other brave soul explore that one.

Moving on, rambling for greater good

I wonder if we can detect some pattern in average post length and upvotes garnered. Do longer posts mean higher scores, or are one liners the true path to karma?

Let's have a look

In [29]:
import datetime, re

if not 'sr_cache' in locals():
    sr_cache = {}

def subreddit_subscribers(subreddit):
    if not subreddit in sr_cache:
        r = requests.get("" % locals())
        sr_cache[subreddit] = r.json["data"]["subscribers"]

    return sr_cache[subreddit]

def makeX(r):
    post_length = len(filter(lambda x: len(x) > 0, re.split("[\W]*",r["data"]["body"])))
    upvotes = r["data"]["ups"]
    downvotes = r["data"]["downs"]
    sr_score = subreddit_subscribers(r["data"]["subreddit"])
    return [upvotes, downvotes, post_length, sr_score]

X = [makeX(r) for r in data]
X = array(X)
In [64]:
def scatt(ax,X1,X2,lbl,title):
    ax.scatter(X1,X2,c="red", s=40)

scatt(axes[0],X[:,3],X[:,0].clip(0,500),"subreddit size","Outliers clipped for more zoomed in view")
scatt(axes[1],X[:,2],X[:,0].clip(0,500),"post length in words","Outliers clipped for more zoomed in view")

Unfortunately I'm not seeing any patterns here. I would have liked to demonstrate scikit learn for some basic machine learning, but that will have to be for another time I guess.

Also worth noting, this whole post was written in ipython notebook. The sexiest REPL you'll ever see.

blog comments powered by Disqus