The Repustate Promoter Score (RPS) explained

Repustate’s new metric

One of the newest and most popular features of Repustate’s Social Media Monitoring platform has been the addition of a metric we call the Repustate Promoter Score, or RPS for short. The RPS provides a simple score, from 0 to 10, that indicates the overall sentiment people have about a particular topic in question. This could be a person, a brand, a product, anything.

The RPS is calculated as follows: Take the total number of positive mentions, the total number of negative mentions, compute the Wilson Confidence interval using those two numbers, and then multiple by 10 to get a score between 0 and 10. That’s a mouth full, let’s dig a bit deeper.

How it works

Total number of positive & negative mentions is easy enough to understand, but what’s this about a Wilson Confidence interval? Well, it’s a math formula which tries to determine how confident you are about a particular outcome given the data. To put this in social terms, let’s say we have 1 tweet about the Apple iPhone and that tweet is positive. How confident are we that everyone is positive about the iPhone? Well, not too confident, because we only have 1 piece of data. But let’s say we have 1000 tweets, and 950 are positive. Now our confidence level rises, and that’s the thinking behind using the Wilson Confidence internal.

Some examples

Here are some examples of Repustate Promoter Scores given a number of positive and negative mentions:

Positive: 1, Negative: 0, RPS: 2
Positive: 10, Negative: 0, RPS: 7
Positive: 10, Negative: 10, RPS: 3
Positive: 0, Negative: 100, RPS: 0
Positive: 100, Negative: 5, RPS: 9

When you create a new data source, Repustate calculates the RPS on an ongoing basis so you can see how the RPS fluctuates over time in your dashboard.

Using Python to data mine Twitter

With the launch of our Social API a little over a month ago, Repustate now allows its customers to create rules to monitor social media, all without ever leaving the confines of your favourite programming environment. But the piece that was missing was the ability to get various statistics and reports about your data sources. Today, that changes with Repustate’s new visualize API call.

If you’re impatient, here’s the gist of it:

If you look through the above gist, you’ll see the steps we take are pretty simple and can be parametrized and automated so you can create graphs and charts as part of your regular reporting process. Our code sample was in Python, but it can easily be ported to Ruby, PHP, Go, or any language of your choosing. To summarize the above code snippet:

  1. We created a new data source
  2. We added a monitoring rule to the source. In this case, we want all mentions of iOS7 on Twitter
  3. We then asked for a graph depicting the gender breakdown for all tweets with positive sentiment for this data source.

With very little code, we’re able to create and visualize complex queries. Neat, huh?
We’ll be adding more visualizations to the API as time passes. And by all means, if you have any suggestions, please let us know!

 Data mining Twitter with Python – A little more detail

Now that we’ve skimmed the surface with that code sample up above, let’s dive a little bit deeper and see what’s going on and what the Social API allows us to do.

The first concept we introduced was the data source. A data source is any social network Repustate monitors (currently Twitter & Facebook). Now, you can create multiple instances of data sources, that is to say you can monitor Twitter for all sorts of different things at the same time.

Each data source can have one or more rules. You’ll see in the code above on lines 14-19, we define our rule. Our rule says “Get me any mention of iOS7 on Twitter, don’t include any retweets, and only tweets from people with at least 1,000 followers”. The last bit, filtering by number of followers, helps remove any spam bots.

At this point, Repustate has enough info to start fetching your data. It might take a few minutes before any data gets populated but once it does, you can run the visualize API call to create graphs and charts that summarize your data. In the above example, we chose the gender filter type, which displays the breakdown of male to female tweeters in our data source.

Here’s a sample of a graph the Repustate returns using the visualize API call:

Positive category filtered by gender

Let’s do another one. This time, let’s take a look at all devices used to create this content, sorted by those which were most commonly used:

Devices used for this data source

You can see with just a few API calls and just a few more lines of Python, we’re able to create pretty insightful visuals. No need for Excel or your own ETL processes – just Repustate’s Social API.

 

Arabic sentiment analysis – now 200x faster

A few months ago we began a task of migrating our Arabic sentiment analysis engine from a Python/Cython implementation to a Go implementation. The reason: speed. Go makes asynchronous programming and concurrency a cinch to use and that’s where we were able to realize some crazy speed boosts.

Our English language sentiment analysis engine can analyze about 500 documents / second. Our Arabic sentiment engine, in Python, did about 2 documents / second. As a result, we never allowed our customers to use our bulk sentiment API call with Arabic text because it would be too slow. Until now.

Now that our engine is in Go, we’ve opened up bulk scoring in Arabic for everyone. You can now score 100 Arabic documents per API call. We’ll gradually increase the limit as time goes on – we’re just testing the waters for now.

We believe our solution was already the most accurate Arabic sentiment engine available – now we’re quite sure we have the fastest one out there, too.

So go give it a shot and let us know what you think.

What’s the secret to great customer service? Don’t be a jerk.

One of the most frustrating aspects of dealing with larger companies is the customer service experience. Whether it’s being caught in a game of customer service hot potato (“Please wait while we transfer you for the 8th time”) or being told “I’m sorry we can’t do that, company policy” it can permanently ruin a customer’s relationship with a company. While larger companies perhaps feel strict policies and procedures provide a consistent experience and therefore make their business easier to administer in the long run, Repustate has never adopted a one-size-fits-all approach to customer service. Each customer is different, so let’s treat them that way.

Examples of good customer service

The one rule we do hold constant is this: Don’t be a jerk. It’s kind of our version of Google’s  (in)famous “Don’t be Evil” motto. Here’s a few examples of “Don’t be a jerk” in practice:

1) I wouldn’t have believed it until I saw it for myself, but every so often, we get customers who sign up for Repustate’s priced plans, use the plan for the month, then contact us and say they’d like a refund for that month. Crazy, I know, and we would be entirely correct and fair if we said “Are you out of your mind? That’s like getting a bucket of chicken from KFC and returning a bucket full of bones and asking for your money back”. But you know what – we always refund their money. It’s just not worth our time to argue and go back and forth. Also, we’re hoping that our leniency and understanding will result in these customers coming back one day. In our experience, about 50% of the time they do come back at a later date as permanent, paying customers.

2) We often get students contacting us wanting to use Repustate for their research papers. The free plan we offer only provides 1000 API calls per month and often these students need hundreds of thousands of calls to complete their project. We always give students a free plan with unlimited usage for a few weeks on condition that they A) cite Repustate in the paper/project and B) send us the final paper or a link to the site. We like to think it keeps Repustate in good standing with the karma gods, but more seriously, these students will graduate one day, go work somewhere and might recommend Repustate to their employers.

3) Prompt email responses. Sounds like a no-brainer, but Repustate does not send out “Thank you for contacting Repustate” emails. I hate getting those because it gets your hopes up that someone replied, and then you check the contents, and you get sad :(  We just reply promptly, always same day, usually within the hour.

4) Honour your existing customers. We recently underwent some price plan changes and modified the account quotas. Now, we could have gone to all of our existing customers and said “Too bad, new quotas in place” but we didn’t. We continue to honour some of our older plans because that’s the right thing to do.

To sum up, just be nice, it’s not that hard. It’ll pay off in the long run, if not the short run as well.

What Python developers need to know before migrating to Go(lang)

This is a (long) blog post about our experience at Repustate in migrating a big chunk of code from Python/Cython to Go. If you want to read the whole story, background and all, read on. If you’re interested in just what Python developers need to know before taking the plunge, click the link below.

Tips & tricks in migrating from Python to Go.

The Background

One of the best technological feats that we’ve done here at Repustate was implementing Arabic sentiment analysis. Arabic is one tough nut to crack because of the complex morphological forms Arabic words can take. Tokenization (splitting a sentence up into individual words) is also tougher in Arabic than in say, English, because Arabic words can contain whitespace within the word itself (e.g. the position of ‘aleph’ within a word). Without giving away our secret recipe, Repustate uses support vector machines (SVM) to come up with the most likely meaning behind a sentence and then apply sentiment to that. In total, we use 22 models (i.e 22 SVMs) and each word in a document is analyzed. So if you have 500 words in a document, that’s more than 10,000 comparisons against the SVMs.

Python

Repustate is almost entirely a Python shop; we use Django for the API and website. So it only made sense (at the time) to keep the code base homogenous and implement all of the Arabic sentiment engine in Python as well. As far as prototyping and implementing goes, Python is hard to beat. Very expressive, awesome 3rd party libraries etc. If you’re serving up web pages, it’s perfect. But when you’re doing low level computations, doing lots of comparisons against hash tables (dictionaries in Python), things get slow. We were able to process about 2-3 Arabic documents per second, which is too slow. By comparison, our English language sentiment engine can do about 500 per second.

The Bottleneck

So we fired up the Python profiler and began investigating what was taking so long. Remember above how I said we have 22 SVMs and each word passes through it? Well that was all done in serial, not in parallel. OK, our first thought was to change to this to a map/reduce like operator. TL;DR: The map/reduce idiom stinks in Python. When you need concurrency, Python is just not your friend. At PyCon 2013, Guido spoke about Tulip, his new project that was hoping to remedy this, but that’s not due out for a while, and why wait there’s already something better out there.

Golang or go home

My friend at Mozilla told me that Mozilla Services was switching over to Go for much of their logging infrastructure, in part because of the awesomeness of goroutines. Go was designed by the folks at Google and it was designed with concurrency as a first-class notion, not an afterthought, as Python’s various solutions are. So we went about making the change from Python to Go.

While the Go code is not yet in production, the results are ridiculously encouraging. We’re doing 1000 documents/s now, using WAY less memory, and not having to debug ugly multiprocess/gevent/”why won’t Control-C kill my process” code that you get in Python.

Why we love Go

Anyone who has a bit of an understanding of how programming languages work (interpreted vs. compiled, dynamic vs. static) will say, “Well duh, obviously Go is faster”. Yeah, we could have re-written the whole thing in Java and seen similar improvements, but that’s not why Go is such a winner. The code you write with Go just seems to be correct. I can’t really put my finger on it, but somehow once the code compiled (and it compiles QUICKLY), you just get the feeling that it’ll work (not just run without error, but even logically be correct). I know, that sounds very wishy-washy, but it’s true. It’s very similar to Python in terms of verbosity (or lack thereof) and it treats functions as first-class objects, so functional programming is easy to reason about. And of course, goroutines and channels make your life so much easier. So you get the performance boost of static typing and having finer control over memory allocation but you don’t forfeit too much in expressiveness.

Things we wish we knew

With all the compliments out of the way, you really do need a different mindset at times when dealing with Go compared to Python. So here’s a list of notes I kept as the migration took place – just random things that popped into my head when converting Python code to Go:

  • No built-in type for sets (have to use maps and test for existence)
  • In absence of sets, have to write your own intersection, union etc. methods
  • No tuples, have to write your own structs or use slices (arrays)
  • No __getattr__() like method, so you have to always check for existence rather than setting defaults e.g. in Python you can do value = dict.get(“a_key”, “default_value”)
  • Having to always check errors (or at least explicitly ignore them)
  • Can’t have variables/packages that aren’t used so to test simple things requires sometimes commenting out lines
  • Going between []byte and string. regexp uses []byte (they’re mutable). It makes sense, but it’s annoying all the same having to cast & re-cast some variables.
  • Python is more forgiving. You can take slices of strings using indexes that are out of range and it won’t complain. You can take negative slices – not Go.
  • You can’t have mixed type data structures. Maybe it’s not kosher, but sometimes in Python I’ll have a dictionary where the values are a mix of strings and lists. Not in Go, you have to either clean up your data structures or define custom structs
  • No unpacking of a tuple or list into separate variables (e.g. x,y,x = [1,2,3])
  • UpperCamelCase is the convention (if you don’t have a title case on the function/struct in a package it won’t be exposed to other packages). I like Python’s lower_case_with_underscores more.
  • Have to explicitly check if errors are != nil, unlike in Python where many types can be used for bool-like checks (0, “”, None can all be interpreted as being “not” set)
  • Documentation on some modules (e.g. crypto/md5) is sparse BUT go-nuts on IRC is awesome, really great support available
  • Type casting from number to string (int64 -> string) is different than going from []byte -> string (just use string([]byte)). Need to use strconv.
  • Reading Go code is definitely more like a programming language whereas Python can be written as almost pseudocode. Go has more non-alphanumeric characters and uses || and && instead of “or” and “and”.
  • Writing to a file, there’s File.Write([]byte) and File.WriteString(string) – a bit of a departure for Python developers who are used to the Python zen of having one way to do something
  • String interpolation is awkward, have to resort to fmt.Sprintf a lot
  • No constructors, so common idiom is to create NewType() functions that return the struct you want
  • Else (or else if) has to be formatted properly, where the else is on the same line as the curly bracket from the if clause. Weird.
  • Different assignment operator is used depending on whether you are inside & outside of function ie. = vs :=
  • If I want a list of just the keys or just the value, as in dict.keys() or dict.values(), or a list of tuples like in dict.items(), there is no equivalent in Go, you have to iterate over maps yourself and build up your list
  • I use an idiom at times of having a dictionary where the values are functions that I want to invoke given a key. You can do this in Go, but all functions have to accept & return the same thing i.e. have the same method signature
  • If you’re using JSON and your JSON is a mix of types, goooooood luck. You’ll have to create a custom struct that matches the format of your JSON blob, and then Unmarshall the raw json into an instance of your custom struct. Much more work than just obj = json.loads(json_blob) like we’re used to in Python land.

Was it worth it?

Yes, a million times, yes. The speed boost is just too good to pass up. Also, and this counts for something I think, Go is a trendy language right now, so when it comes to recruiting, I think having Go as a critical part of Repustate’s tech stack will help.

Profiling Go programs with pprof

There’s an old blog post floating around on the Go Blog with instructions on how to profile your code using pprof – except it’s out of date and  might confuse some people. Here’s an up-to-date (as of Go 1.1) way to profile your code.

1. import runtime/pprof
2. Where you want the CPU profiling to start, you write:

pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()

This has the effect of starting the profiler and then ending it once the containing function (often your “main” function) exits. The argument you pass to StartCPUProfile, “f”, is an open file descriptor to which the profile information will be written to. You can create this file by writing:

f, err := os.Create("my_profile.file")

3. OK we have our profiling code setup. Now, build your binary. Let’s say my go file is called “foo.go”, I would run “go build foo.go”. This will create a binary called “foo”.
4. Run your binary (e.g. ./foo)
5. Now you run “go tool pprof foo my_profile.file” and you will be in the pprof interactive prompt. Here you can issue commands like top10 or top50 etc. Read up on pprof to find out all the commands you can enter.

Social media data categorized automatically

Social media data is being generated faster than we can count, but it doesn’t have to be hard to organize and analyze the mountain of data. Repustate’s newest API call categorizes your social media data making it easier for you to analyze and report on the data that is important to your company.

Categorize by industry

Repustate’s categorization works within the context of a particular industry. Currently, the supported industries are airlines, hotels, telecommunications, and restaurants. The possible categories for each industry are different because of course, the industries themselves are quite different. To view the full list of categories available for each, take a look at our API docs.

Social media example (and when sentiment alone is not enough)

Let’s say we are a hotel chain, Repustate Hotels, and we have a Twitter account used for getting feedback from our customers. When a customer tweets “@RepustateHotels I loved the beds in your hotel, but the food was terrible”, there are a few pieces of sentiment being  stated here. Is the overall sentiment of this sentence positive or negative? I would argue it’s both. The customer liked the rooms, but did not like the food. It would be useful to extract these bits of information separately. That’s what Repustate’s API call does automatically.

As a result of passing the above tweet to the categorize API call, you’ll get the following structure back (in JSON):

 {"food": [{"chunk": "the food was terrible", "score": -0.42798}], "accommodations": [{"chunk": "@ RepustateHotels_I loved the beds in your hotel", "score": 0.23237}]}

Now very quickly you can see the various categories within the hotel industry that this text matches up with. Expanding on this idea, you can envision how this would be an invaluable tool to assess customer satisfaction across your entire organization and see which areas need to be improved.

How to extract images from a web page

Ever wanted to extract images from web pages? Now you can with one simple API call.

 

Repustate’s clean-html API call has been one of our most popular API calls since Day 1. It hasn’t been touched much as its performance was quite good from the get-go, but that changed recently. Now you can extract images as well as the text from any web page.

We had a customer request to add the ability to extract the main image from a web page as well, similar to how Instapaper or Mobile Safari’s “Reader” feature works.

Now by default, when you call clean-html, an image attribute comes back with a URL for the main image, if it exists, for a given article.

Let’s take a look at an example. You’ll need a Repustate API key to try this on your own but it’s free and easy to get one. Let’s take this URL:

http://www.thestar.com/news/insight/2013/02/15/challenging_the_vatican_progressive_catholics_say_reform_must_begin_with_church_governance.html

and pass it to our API call.

curl -d "url=http://www.thestar.com/news/insight/2013/02/15/challenging_the_vatican_progressive_catholics_say_reform_must_begin_with_church_governance.html" http://api.repustate.com/v2/YOUR_API_KEY/clean-html.json

And here’s the response:

 {"status": "OK", "text": "To progressive Canadian Catholic ... (shortened for this example)", "image": "http://www.thestar.com/content/dam/thestar/news/insight/2013/02/15/challenging_the_vatican_progressive_catholics_say_reform_must_begin_with_church_governance/vatican_lightning.jpg.size.xxlarge.promo.jpg", "url": "http://www.thestar.com/news/insight/2013/02/15/challenging_the_vatican_progressive_catholics_say_reform_must_begin_with_church_governance.html"}

As you can (kind of) see, there is an ‘image’ key in the JSON response with a URL for the main image of that article.

With this API call, you can create your own version of Instapaper or Readability for your own purposes.

How to manage large data structures in Python

Our sentiment analysis relies on various models to do feature extraction for the purposes of determining sentiment. These models are stored in the form of Python dictionaries and they’re big – about 40MB each, and there are 14 models in total. Doing the math, that’s 560MB of data that must be available to a process. Now, not all of the data is used for each request; only some of the keys in these dictionaries are used for a particular piece of text that needs to be analyzed.

With all this in mind, we recently hit a performance snag with our sentiment engine. The naive approach was to load these models at server startup time by making a reference to the models in the settings.py file of our Django app. (Repustate’s API is Django under the hood). The downside was that it took a while, about 40s, to start up the API server each time we made a change to the code. The upside was that our code was easy to reason about because we’re just using Python dictionaries and they’re simple to use.

But our memory usage, as you can imagine, skyrocketed. We use Apache’s mpm-worker module to run our API servers and each process gets a copy of those models. Our machines have a good amount of RAM, but not enough, and we started to see lots of thrashing and swapping going on. Performance degraded as time went on. We needed a better solution.

Enter the ModelProxy

 We came up with the notion of a ModelProxy. This Python class would mimic a normal Python dictionary as much as possible in order to minimize changes to the code (e.g. my_model_proxy['some_key'] would still work) but under the hood, the special “magic” methods would access a separate data store.

We took our models and flattened them out, storing the flat key/value pairs in MongoDB. What does a flattened dictionary look like? Imagine you have a dictionary with the following structure:

{'a':'b', 'c':{'d':'e'}}

A flattened version would look like this:

{'a':'b', 'c.d':'e'}

As you can see, we combined the keys ‘c’ and ‘d’ to make a new key ‘c.d’. Applying this recursively converts an otherwise several level deep dictionary into a flat, simple key/value pair dictionary of only one level. How you combine the individual keys to form the new “composite” key is up to you and might be influenced by your underlying data store. MongoDB doesn’t allow ‘.’ in key names, so we used ‘//’ as our separator. Here’s the code to flatten a Python dictionary. You’ll notice in the code that we also hash the keys – again that’s to remove any ‘.’ in the key names, which our keys do contain.

So now we have a flattened dictionary. We push these key/value pairs into a MongoDB collection and now we’ll have a really fast, process-independent , data store. The last bit is the aforementioned ModelProxy. The proxy provides a similar interface to a Python dictionary, but whenever __getitem__ is called vis-a-vie the “index” notation i.e. some_dict[some_key] we convert that into a query on our MongoDB collection. If we use the dictionary from the first example above, this is how you’d use the model proxy:

>>> proxy['a'].value()
'b'
>>> proxy['c']
<ModelProxy instance>
>>> proxy['c']['d'].value()
'e'

You’ll notice that when we called __getitem__ on a key whose value is another dictionary, the return value was a new ModelProxy instance. This is how we’re able to traverse an otherwise multilevel dictionary without having to change any notation or Python code. Here’s the code for the ModelProxy. You’ll also see that we have some helper functions in there to facilitate getting the values for multiple keys all in one query.

The one downside is we had to introduce the value method to actually retrieve the value from MongoDB, but other than that, we were able to mimic a Python dictionary perfectly (or as much as we needed to).

A very simple example of using “expect” on unix

Often times you have to interact with programs that require passwords or some other input from the user. For security purposes, some programs will not read from stdin so you have to be creative. Enter “expect”. Expect is a program written in Tcl that allows you to mimic a conversation you’d have with any number of programs. There are lots of examples on the web, but I wanted to put up a really simple one just to get the picture across of how it can be used.

Let’s consider this simple python program, foo.py:

input = raw_input("Enter:")
print "You entered %s" % input

Running python foo.py will print “Enter:” to the command line and wait for the user to enter something followed by hitting “Enter”. It will then print out what the user entered, a simple echo program. Let’s drive this programatically using expect (aside: yes, I know you don’t need expect to something like this).

Here’s our expect script:

#!/usr/bin/expect
spawn python foo.py
expect "Enter:"
send "Repustate\r"
expect eof

Line by line:

  1.  Indicates which script should be used to execute the code that follows
  2.  spawn starts a new program. So we’re running our python program as we normally would
  3. OK here’s the magic; we’re saying that we expect the string “Enter:” to be returned by the spawned program.
  4. Now we’re saying send the string “Repustate\r” (watch that carriage return, we need it otherwise python will just wait) to the waiting python process.
  5. This line means to wait for the program we spawned to finish before we exit the expect script.

There you have it, a really simple intro how to use expect to expect what you expect. There is so much you can do with expect and it’s left to the reader to go out and discover all these nifty features. It really comes in handy when you want to interact with programs that require passwords but for security purposes, won’t read from stdin.