Segmenting Twitter hashtags

Segmenting Twitter hashtags to gain insight

Twitter hashtags are everywhere these days and the ability to data mine them is an important one. The problem with hashtags is they are one long string that is composed of a few smaller ones. If we were able to segment the long hash tag into its individual words, we can gain some extra insight into the context of the tweet and maybe determine the sentiment as a result.

So what to do? How do we solve the problem of the long, single string?

Use the probabilities, Luke

As we did with Chinese sentiment, we had to rely on conditional probabilities to determine the most likely words in a string of characters. Put differently, you’re trying to answer the question: “If the previous word was X, what are the odds the next word is Y?” To answer this, you need to build up a probability model using some tagged data. We grabbed the most common bigrams from Google’s ngram database and then using the frequencies listed, constructed a probability model.

To better understand why we needed the probabilities, let’s take a look at a concrete example. Take the following hashtag: #vacationfallout. There are two possible segmentations here, [“vacation”, “fallout”] or [“vacation”, “fall”, “out”]. So how we do know which to use? We examine the probability that the string “fallout” comes after “vacation”. This probability, as we know from our model, is higher than the probability of the words “fall” and “out” coming after “vacation”, so that’s the one we go with.

Now of course, since we’re dealing with probabilities, we might be wrong. Perhaps the author did intend for that hashtag to mean [“vacation”, “fall”, “out”]. But we learn to live with the fact that we’ll be wrong sometimes; the key is that we’ll be wrong much less frequently than when we’re right.

Memoizing to increase performance

Since the Repustate API is hit pretty heavily, we still need to be concerned with performance. The first step we take is to determine if there is a hashtag to expand. We do this using a simple regular expression. The next step, once we’ve determined there is a hashtag present, is to expand it into its individual words. To make things go a bit faster, we memoize the functions we use so that when we encounter the same patterns, and we do, we won’t waste time calculating things from scratch each time. Here’s the decorator we use to memoize in Python:

Using Python’s AST parser to translate text queries to regular expressions.

Python AST (Abstract Syntax Tree) module is pretty darn useful

We recently introduced a new set of API calls for our enterprise customers (all customers will soon have access to these APIs) that allows you to create customized rules for categorizing text. For example, let’s say you want to classify Twitter data into one of two categories: Photography or Photoshop. If it has to do with photography, that’s one category, if it has to do with Photoshop, that’s another category.

So we begin by listing out some boolean rules as to how we want to match our text. We can use the OR operator, we can use AND, we can use negation by placing a “-” (dash or hyphen) before a token and we can use parentheses to group pieces of logic together. Here are our definitions for the two categories:

Photography: photo OR camera OR picture
Photoshop: “Photoshop” -photo -shop

The first rule states if a tweet has at least one of the words “photo”, “camera” or “picture” then classify it as being in the Photograph category. The 2nd rule states if it has the word “Photoshop” and does not contain the words “photo” and “shop” by themselves, then this piece of text is under the Photoshop category. You’ll notice there’s an implicit AND operator where ever we use a white space to separate tokens.

Now one approach would be to take your text, tokenize it into a bag of words, and then go one by one through each of the boolean clauses and see which matches. But that’s way too slow. We can do this much faster using regular expressions.

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

The hilarity of the quote above not withstanding, this problem is ready made for regular expressions because we can compile our rules once at startup time and then just iterate over them for each piece of text. But how do you convert the category definitions above into a regular expression, using negative lookaheads/behinds and all that other regexp goodness? We used Python’s AST module.

AST to the rescue

Thinking back to your days in CS, you’ll remember that an expression like 2 + 2 can be parsed and converted into a tree, where the binary operator ‘+’ is the parent and it has two child nodes, namely ‘2’ and ‘2’. If we take our boolean expressions and replace OR with ‘+’ and the AND operator (whitespace) with a ‘*’, we can feed our text into the Python ast module like so:

The “process” method, shown below, is what then traverses the tree and emits the necessary regular expression text:

(I’ve omitted the code for some of the helper methods but what you see here is the heart of the algorithm). So the final regular expression for the two rules about would look like this:

Photograph: ‘photo|camera|picture’
Photoshop: ‘(?=.*(?=.*\bPhotoshop\b).*^((?!photo).)*$).*^((?!shop).)*$’

That second rule in particular is a doozy because it’s using lookarounds which are a pain in the butt to try to manually derive.

The AST module emits a tree where each node has a type. So when traversing the tree, we just have to check which type of node we’re dealing with and proceed accordingly. If it’s a binary operator for example, such as the OR operation, we know we have to put a pipe (i.e. “|”) between the two operands to form an “OR” in regular expression syntax. Similarly, AND and NOT are processed accordingly and since it’s a tree, we can do this recursively. Neat.

(More documentation on the AST module can be found here.)

The final product is a very fast regular expression which allows us to categorize text quickly and accurately. Next post, I’m going to talk about semantic categorization (e.g. Tag all pieces of text that have to do with football or baseball under the “Sports” category) so stay tuned!

Provisioning virtual appliances with Vagrant

Virtual Appliances

Repustate’s API is great for 95% of our customers – but for that other 5%, we needed something else. The reason being is that the 5% are enterprise customers who cannot transmit their data (for both legal and security reasons) across the public internet, even with HTTPS. So we created the Repustate Server product which is our API housed within a virtual appliance, distributed for either VMware or VirtualBox. Since it’s a virtual appliance, we’re really just shipping out a self-contained OS with all the bits & pieces installed, configured and ready to use. No installation, no compilation, just download the Server instance, fire it up, and you’re off. Sounds great, right, what’s the problem? Well, provisioning demos, assigning static IPs, and provisioning multiple instances can get tricky and tedious.

Demos

Many customers request demos of the Repustate Server. Our demos need to be configured with some data to disable the virtual appliance after a certain date. (Since one of our selling points is that the Repustate Server requires no external internet access, all configurations must be included, no “phoning-home” allowed).

Furthermore, details about the customer need to be embedded into the virtual appliance. At first, we would clone a “base” virtual machine, edit the various files, package it up, upload it somewhere and then send a link. Yuck. It took way too long and we had to keep making sure the virtual appliance was up-to-date with all of the recent changes to the Repustate code base (multiple code repositories, precompiled binaries etc.)

Static IPs

Many customers could not just fire up a virtual appliance and use DHCP to assign the IP to the virtual appliance. So there was a need for static IPs. A customer would specify the various network settings they needed, we would configure the virtual appliance, and repeat the previous steps of packaging, uploading, and sending out a link to download. Yuck, even more manual, tedious fiddling around.

(Note: I know there’s a tool called VMware Network Editor but apparently it is not included with VM Player, only VM Workstation and some customers for whatever reason don’t have access to the Editor tool, hence the need for us to configure static IPs for them within the guest OS)

Multiple Instances

Our larger customers use multiple Repustate Servers in parallel to form a cluster of machines for the purposes of increasing throughput. If you’re trying to consume and analyze a steady stream of text you’re going to need some decent processing power to keep up, especially if you want things to be synchronous. To that end, our customers put a load balancer up in front of several Repustate Server instances and now they have their own text analytics cluster chugging away with little-to-no maintenance. Ah, but when our customers want a new instance to throw into the cluster? They can’t just clone an existing one because of the static IP issue described above. So they have to ask us for a new instance. Yuck.

Vagrant to the rescue

Alright, this was a long setup for the payoff – Vagrant. Vagrant allows you to programmatically (i.e. from the shell, no GUI required) create new instances of a virtual machine, for either VirtualBox or VMware. You can completely configure the new instances using Chef or Puppet to download whatever code you need, compile any packages you want etc. It completely automates the role of server admin – well, at least the server setup part. With Vagrant, all customizations Repustate needs to perform for each customer are just automated with Chef. For example, say we need to know when a demo expires. We have a chef recipe that queries our database for the license expiration date, retrieves it, and then stores it in the new virtual appliance.

Customers can now create their Repustate Server instances on their own using a simple form in the Repustate dashboard:

provisioning

This includes setting their own IP settings for those who can’t use DHCP. Once the customer requests their new instance, Vagrant fires up (asynchronously, just like asking Amazon EC2 for a new instance) and begins provisioning the new instance. Once finished, an email is fired off with a link to download the new virtual appliance. Eventually, we’ll add an API call to this so it can all be programmatically done without the need to log in to the dashboard.

A massive time saver for us at Repustate – big thanks to Mitchell for putting Vagrant together.

Chinese sentiment analysis now available

Chinese sentiment analysis is now part of the Repustate API

We are very proud to announce our new Chinese sentiment analysis engine. Based on the same engine that we used to create our world-leading Arabic sentiment engine, the Chinese sentiment analysis engine is blazingly fast and accurate.

For the impatient ones, if you want to try it out, you can use our online demo here. If you’re interested in how our Chinese sentiment analysis works, read on!

Conditional Random Fields

Unlike English or Latin-based languages, Chinese (simplified) doesn’t necessarily disambiguate words using whitespace. For example the following string of symbols is a completely normal sentence in Chinese:

团购分量比较一般,不过肉多,而且是和两个女生,所以基本都能吃饱。 猪手香肠无得讲,的确系一般餐厅做唔出的味道,其他就比较一般啦。 后来和朋友们正价去吃> 了一次,感觉分量比团购多,希望商家以后能一视同仁啦。

(For those who don’t read Chinese, this is a review of a restaurant). Now you’ll see a few white spaces here & there but there’s actually many more words being expressed than there are separated tokens. So how do we know where one word (or idea) begins and the next ends?

We use a technique called conditional random fields which uses probabilistic models to infer what the meaning of a particular glyph (character) is given the glyphs around it. With a large enough pre-tagged corpus of Chinese text, Repustate can achieve almost 100% perfection in identifying the individual words or ideas being expressed in a long chain of Chinese glyphs.

Part of speech tagging & sentiment

Now that we know which words are being used, we can apply part of speech tagging (nouns, verbs, adjectives etc.) to help construct a grammatical overview of a piece of text. This then allows us to perform sentiment analysis using our proprietary engine. It’s the same engine that powers our Arabic sentiment analysis. Sentiment analysis uses a combination of probabilistic models, a dictionary of terms or phrases which connote sentiment as well as hand-tuned heuristics that are language specific. All of this is done in a split second so you can still analyze hundreds of Chinese documents in one HTTP request using the Repustate API.

Try it out!

Create your free account and try out Repustate’s new Chinese sentiment engine today.