Introducing Semantic Analysis

Repustate is announcing today the release of its new product: semantic analysis. Combined with sentiment analysis, Repustate provides any organization, from startup to Fortune 50 enterprise, all the necessary tools they need to conduct in-depth text analytics. For the impatient, head on over to the semantic analysis docs page to get started.

Text analytics: the story so far

Until this point, Repustate has been concerned with analyzing text structurally. Part of speech tagging, grammatical analysis, even sentiment analysis is really all about the structure of the text. The order in which words come, the use of conjunctions, adjectives or adverbs to denote any sentiment. All of this is a great first step in understanding the content around you – but it’s just that, a first step.

Today we’re proud and excited to announce Semantic Analysis by Repustate. We consider this release to be the biggest product release in Repustate’s history and the one that we’re most proud of (although Arabic sentiment analysis was a doozy as well!)

Semantic analysis explained

Repustate can determine the subject matter of any piece of text. We know that a tweet saying “I love shooting hoops with my friends” has to do with sports, namely, basketball. Using Repustate’s semantic analysis API you can now determine the theme or subject matter of any tweet, comment or blog post.

But beyond just identifying the subject matter of a piece of text, Repustate can dig deeper and understand each and every key entity in the text and disambiguate based on context.

Named entity recognition

Repustate’s semantic analysis tool extracts each and every one of these entities and tells you the context. Repustate knows that the term “Obama” refers to “Barack Obama”, the President of the United States. Repustate knows that in the sentence “I can’t wait to see the new Scorsese film”, Scorsese refers to “Martin Scorsese” the director. With very little context (and sometimes no context at all), Repustate knows exactly what an arbitrary piece of text is talking about. Take the following examples:

  1. Obama.
  2. Obama is the President.
  3. Obama is the First Lady.

Here we have three instances where the term “Obama” is being used in different contexts. In the first example, there is no context, just the name ‘Obama’. Repustate will use its internal probability model to determine the most likely usage for this term is in the name ‘Barack Obama’, hence an API call will return ‘Barack Obama’. Similarly, in the second example, the word “President” acts as a hint to the single term ‘Obama’ and again, the API call will return ‘Barack Obama’. But what about the third example?

Here, Repustate is smart enough to see the phrase “First Lady”. This tells Repustate to select ‘Michelle Obama’ instead of Barack. Pretty neat, huh?

Semantic analysis in your language

Like every other feature Repustate offers, no language takes a back seat and that’s why semantic analysis is available in every language Repustate supports. Currently only English is publicly available but we’re rolling out every other language in the coming weeks.

Semantic analysis by the numbers

Repustate currently has over 5.5 million entities, including people, places, brands, companies and ideas in its ontology. There are over 500 categorizations of entities, and over 30 themes with which to classify a piece of text’s subject matter. And an infinite number of ways to use Repustate to transform your text analysis.

Head on over to the Semantic Analysis Tour to see more.

 

Using python requests – whose encoding is it anyway?

Python requests encoding – using the Python requests module might give you surprising results

If you do anything remotely related to HTTP and use Python, then you’re probably using requests, the amazing library by Kenneth Reitz. It’s to Python HTTP programming what jQuery is to Javascript DOM manipulation – once you use it, you wonder how you ever did without it.

But there’s a subtle issue with regards to encodings that tripped us up. A customer told us that some Chinese web pages were coming back garbled when using the clean-html API call we provide. Here’s the URL:

http://finance.sina.com.cn/china/20140208/111618150293.shtml

In the HTML on these pages, the charset is gb2312 which is an encoding that came out of China used for the Simplified Chinese set of characters. However, many web servers do not send this as the charset in the response headers (due to the programmers, not the web server itself). As a result, requests defaults to ISO 8851-9 as the encoding when the response doesn’t contain a charset. This is done in accordance with RFC 2616. The upshot is that the Chinese text in the web page doesn’t get encoded properly when you access the encoded content of the response and so what you see is garbled characters.

Here’s the response headers for the above URL:

curl -I http://finance.sina.com.cn/china/20140208/111618150293.shtml
HTTP/1.1 200 OK

Content-Type: text/html
Vary: Accept-Encoding
X-Powered-By: schi_v1.02
Server: nginx
Date: Mon, 17 Feb 2014 15:54:28 GMT
Last-Modified: Sat, 08 Feb 2014 03:56:49 GMT
Expires: Mon, 17 Feb 2014 15:56:28 GMT
Cache-Control: max-age=120
Content-Length: 133944
X-Cache: HIT from 236-41.D07071951.sina.com.cn

There is a thread on the Github repository for requests that explains why they do this – requests shouldn’t be about HTML, the argument goes, it’s about HTTP so if a server doesn’t respond with the proper charset declaration, it’s up to the client (or the developer) to figure out what to do. That’s a reasonable position to take, but it poses an interesting question: When “common” use or expectations, go against official spec, whose side does one take? Do you tell developers to put on their big boy and girl pants and deal with it or do you acquiesce and just do what most people expect/want?

Specs be damned, make it easy for people

I believe it was former Twitter API lead at the time, Alex Payne, who was asked why does Twitter include the version of the API in the URL rather than in the request header, as is more RESTful. His paraphrased response (because I can’t find the quote) is that Twitter’s goal was to get as many people using the API as possible and settings headers was beyond the skill level of many developers, whereas including it in the URL is dead simple. (We at Repustate do the same thing; our APIs are versioned via the URL. It’s simpler and more transparent.)

Now the odd thing about requests is that the package has an attribute called apparent_encoding which does correctly guess the charset based on the content of the response. It’s just not automatically applied because the response header takes precedence.

We ended up patching requests so that the apparent_encoding attribute is what gets used in the case no headers are set by default, but this is not the default behaviour of the package.

I can’t say I necessarily disagree with the choices the maintainers of requests have made. I’m not sure if there is a right answer because if you write your code to be user friendly in direct opposition to a published spec, you will almost certainly raise the ire of someone who *does* expect things to work to spec. Damned if you do, damned if you don’t.

Social sentiment analysis is missing something

Sentiment analysis by itself doesn’t suffice

An interesting article by Seth Grimes caught our eye this week. Seth is one of the few voices of reason in the world of text analytics that I feel “gets it”. His views on sentiment’s strengths and weaknesses, advantages and shortcomings align quite perfectly with Repustate’s general philosophy.

In the article, Seth states that simply getting relying on a number denoting sentiment or a label like “positive” or “negative” is too coarse a measurement and doesn’t carry any meaning with it. By doing so, you risk overlooking deeper insights that are hidden beneath the high level sentiment score. Couldn’t agree more with this and that’s why Repustate supports categorizations.

Sentiment by itself is meaningless; sentiment analysis scoped to a particular business need or product feature etc. is where true value lies. Categorizing your social data by features of your service (e.g. price, selection, quality) first and THEN applying sentiment analysis is the way to go. In the article, Seth proceeds to list a few “emotional” ones (promoter/detractor, angry, happy etc). that quite frankly I would ignore. These categories are too touchy-feely, hard to really disambiguate at a machine learning level and don’t tie closely to actual business processes/features. For instance, if someone is a detractor, what is that is causing them to be a detractor? Was it the service they received? If so, then customer service is the category you want and negative polarity of the text in question gives you invaluable insights. The fact that someone is being negative about your business means almost by definition they are detractors.

Repustate provides our customers with the ability to create their own categories according to the definitions that they create. Each customer is different, each business is different, hence the need for customized categories. Once you have your categories, sentiment analysis becomes much more insightful and valuable to your business.

Sharing large data structures across processes in Python

At Repustate, much of our data models we use in our text analysis can be represented as simple key-value pairs, or dictionaries in Python lingo. In our particular case, our dictionaries are massive, a few hundred MB each, and they need to be accessed constantly. In fact for a given HTTP request, 4 or 5 models might be accessed, each doing 20-30 lookups. So the problem we face is how do we keep things fast for the client as well as light as possible for the server. We also distribute our software as virtual machines to some customers so memory usage has to be light because we can’t control how much memory our customers will allocate to the VMs they deploy.

To summarize, here’s our checklist of requirements:

  1. Low memory footprint
  2. Can be shared amongst multiple processes with no issues (read only)
  3. Very fast access
  4. Easy to update (write) out of process

So our first attempt was to store the models on disk in a MongoDB and to load them into memory as Python dictionaries. This worked and satisfied #3 and #4 but failed #1 and #2. This is how Repustate operated for a while, but memory usage kept growing and it became unsustainable. Python dictionaries are not memory efficient. And it was too expensive for each Apache process to need a copy of this since we were not sharing the data between processes.

One night I was complaining about our dilemma and a friend of mine, who happens to be a great developer at Red Hat, said these three words: “memory mapped file”. Of course! In fact, Repustate already uses memory mapped files but I completely forgot about this. So that solves half my problem – it meets requirements #2. But what format does the memory mapped file take? Thankfully computer science has already solved all the world’s problems and the perfect data structure was already out there: tries.

Tries (pronounced “trees” for some reason and not “try’s”) AKA radix trees AKA prefix trees are a data structure that lend themselves to objects that need string keys. Wikipedia has a better write up but long story short, tries are great for the type of models Repustate uses.

I found this package, marisa tries, which is a Python wrapper around a C++ implementation of a marisa trie. “Marisa” is an acronym for Matching Algorithm with Recursively Implemented StorAge. What’s great about marisa tries is the storage mechanism really shrinks how much memory you need. The author of the Python plugin claimed 50-100X reduction in size – our experience is similar.

What’s great about the marisa trie package is that the underlying trie structure can be written to disk and then read in via a memory mapped object. With a memory mapped marisa trie, all of our requirements are now met. Our server’s memory usage went down dramatically, by about 40%, and our performance was unchanged from when we used Python’s dictionary implementation.

Next time you’re in need of sharing large amounts of data, give memory mapped tries a chance.

 

Chinese POS Tagger (and other languages)

Need an Arabic part of speech tagger (AKA an Arabic POS Tagger)? How about German or Italian? You’re in luck – Repustate’s internal POS taggers have been opened up via our API to give our developers the ability to slice and dice multilingual text the way we do.

The documentation for the POS tagger API call outlines all you need to know to get started. With this new API call you get:

  • English POS tagger
  • German POS tagger
  • Chinese POS tagger
  • French POS tagger
  • Italian POS tagger
  • Spanish POS tagger
  • Arabic POS tagger

Beyond this, we’ve unified the tag set you get from the various POS taggers so that you only have to write code once to handle all languages. The complete tag set includes nouns, adjectives, verbs, adverbs, punctuation marks, conjunctions and prepositions. Give it a shot and let us know what you think.

Twitter + GeoJSON + d3.js + Rob Ford = Fun with data visualizations

Using Twitter, GeoJSON (via GDAL) and d3 to visualize data about Rob Ford

(Note: Gists turned into links so as to avoid too many roundtrips to github.com which is slow sometimes)

As a citizen of Toronto, the past few weeks (and months) have been interesting to say the least. Our mayor, Rob Ford, has made headlines for his various lewd remarks, football follies, and drunken stupors. With an election coming up in 2014, I was curious as to how the rest of Toronto felt about our mayor.

The Task

Collect data from Twitter, making sure we only use tweets from people in Toronto. Plot those tweets on an electoral map of Toronto’s wards, colour coding the wards based on how they voted in the 2010 Mayoral Election.

The Tools

  1. Repustate’s API (for data mining Twitter and extracting sentiment, of course!)
  2. Shapefiles for the City of Toronto including boundaries for the wards
  3. GDAL
  4. d3.js
  5. jQuery

I’m impatient, just show me the final product …

Alright, here’s the final visualization of Toronto’s opinion of Rob Ford. Take a look and then come back here and find out more about how we accomplished this.

The Process

For those still with me, let’s go through the step-by-step of how this visualization was accomplished.

Step 1: Get the raw data from Twitter

This is a given – if we want to visualize data, first we need to get some! To do so, we used Repustate to create our rules and monitor Twitter for new data. You can monitor social media data using Repustate’s web interface, but this is a project for hackers, so let’s use the Social API to do this. Here’s the code (you first have to get an API key):

https://gist.github.com/mostrovsky/7762449

Alright, so while Repustate gets our data, we can get the geo visualization aspect of this started. We’ll check back in on the data – for now, let’s delve into the world of geographic information systems and analytics.

Step 2: Get the shapefile

What’s a shapefile? Good question! Paraphrasing Wikipedia, shapefiles are a file format that contain vector information describing the layout and/or location of geographic features, like cities, states, provinces, rivers, lakes, mountains, or even neighbourhoods. Thanks to a recent movement to open up data to the public, it’s not *too* hard to get your hands on shape files anymore. To get the shapefiles for the City of Toronto, we went to the City’s OpenData website, which has quite a bit of cool data there to play with. The shapefile we need, the one that shows Toronto divided up into it’s electoral wards, is right here. Download it!

Step 3: Convert the shapefile to GeoJSON

So we have our shapefile, but it’s in a vector format that’s not exactly ready to be imported into a web page. The next step is to convert this file into a web-friendly format. To do this, we need to convert the data in the shapefile into a JSON-inspired format, called GeoJSON. GeoJSON looks just like normal JSON (it is normal JSON), but it has a spec that defines what a valid GeoJSON object must contain. Luckily, there’s some open source software that will do this for us: the Geospatial Data Abstraction Library or GDAL. GDAL has *tons* of cool stuff in it, take a look when you have time, but for now, we’re just going to use the ogr2ogr command that takes a shapefile in and spits out GeoJSON. Armed with GDAL and our shapefile, let’s convert the data into GeoJSON:

ogr2ogr \
  -f GeoJSON \
  -t_srs EPSG:4326
  toronto.json \
  tcl3_icitw.shp

This new file, toronto.json, is a GeoJSON file that contains the co-ordinates for drawing the polygons representing the 44 wards of Toronto. Let’s take a look at that one liner just so you know what’s going on here:

  1. ogr2ogr is the name of the utility we’re using; it comes with GDAL
  2. Specifying the format of the final output (-f option)
  3. This step is *very* key, we’re making sure the output is projected/translated into the familiar lat/long co-ordinate system. If you omit this (in fact, try omitting this flag), you’ll see the output will require some work from us on the front end, which is unnecessary.
  4. The name of the output file
  5. The input shapefile

Alright, GeoJSON file done.

Step 4: Download the data

Assuming enough time has passed, Repustate will have found some data for us and we can retrieve it via the Social API:

https://gist.github.com/mostrovsky/7762900

The data comes down as a CSV. It’ll remain an exercise for the reader to store the data in a format that you can query and retrieve thereafter, but let’s just assume we’ve stored our data. Included in this data dump will be the latitude and longitude of each tweet(er). How do we know if this tweet(er) is in Toronto? There’s a few ways actually.

We could use PostGIS, an unbelievably amazing extension for PostgreSQL that allows you to do all sorts of interesting queries with geographic data. If we went this route, we could load the original shapefile into a PostGIS enabled database, then for each tweet, use the ST_Contains method to check which tweet(er)s are in Toronto. Another way to do this is to iterate over your dataset and for each one, do a quick point in polygon calculation in the language of your choice. The polygon data can be retrieved from the GeoJSON file we created. Either way you go, it’s not too hard to determine which data originated from Toronto, given specified lat/long co-ordinates.

Step 5: Visualize the data

Let’s see how this data looks. To plot the GeoJSON data and each of the data points from our Twitter data source, we use d3.js For those not familiar with d3.js (Data Driven Documents) – it’s a Javascript library that helps you bind data to graphical representations in pure HTML/SVG. It’s a great library to use if you want to visualize any data really, but it makes creating unique and customized maps pretty easy. And best of all, since the output is SVG, you’re pretty much guaranteed that your visuals will look identical on all browsers (well, IE 8 and down don’t support SVG) and mobile devices. Let’s dive into d3.js. We need to plot 2 things:

  1. The map of Toronto, with dividers for each of the wards
  2. The individual tweets, colour coded for sentiment and placed in the correct lat/long location.

The data for the map comes from the GeoJSON file we created earlier. We’re going to tell d3 to load it and for each ward, we’re going to draw a polygon (the “path” node in SVG) by following the path co-ordinates in the GeoJSON file. Here’s the code:

https://gist.github.com/mostrovsky/7776778

Path elements have an attribute, “d”, which describes how the path should be drawn. You can read more about that here. So for each ward, we create a path node, set the “d” attribute to contain the lat/long co-ordinates that when connected, form the boundaries of the ward. You’ll see we also have custom fill colouring based on the 2010 Mayoral election – that’s just an extra touch we added. You can play around with the thickness of the bordering lines. We’re also adding a mouseover event to show a little tooltip about the ward in question when the user hovers over it. Time to add the data points. We’re going to separate them by date, so we can add a slider later to show how the data changes over time. We’re also including sentiment information that Repustate automatically calculates for each piece of social data. Here’s an example of how one complete data point would look:

{
 "lat": 43.788458853157117, 
 "mm": 10,
 "lng": -79.282781549043008,
 "dd": 28,
 "sentiment": "pos"
}

Now to plot these on our map, we again load a JSON file and for each point, we bind a “circle” object, positioned at the correct lat/long co-ordinates. Here’s the code:

https://gist.github.com/mostrovsky/7776799

At this point, if you tried to render the data on screen, you wouldn’t see a thing, just a blank white SVG. That’s because you need to deal with the concept of map projections. If you think about the map of the world, there’s no “true” way to render it on a flat, 2D surface, like your laptop screen or iPad. This has given rise to the concept of projections, which is the process of placing curved data onto a flat plane. There are many ways to do this and you can see a list here of available projections; for our example, we used the Albers projection. Last thing we have to do is scale and transform our SVG and we’re done. Here’s the code for projecting, scaling and transforming our map:

https://gist.github.com/mostrovsky/7776853

Those values are not random – they came about from some trial & error. I’m not sure how to do this properly or in algorithmic way.

Please take a look at the finished product and of course, if you or your company is looking into doing this sort of thing with your own data set, contact us!

Building a better TripAdvisor

(This is a guest post by Sarah Harmon. Sarah is currently a PhD student at UC Santa Cruz studying artificial intelligence and natural language processing. Her website is NeuroGirl.com)

Click here to see a demo of Sarah’s work  in action.

Using Repustate’s API to to get the most out of TripAdvisor customer reviews

I’m a big traveler, so I often check online ratings, such as those on TripAdvisor, to decide which local hotel or restaurant is worth my time.  In this post, I use Repustate’s text categorization API to analyze the sentiment of hotel reviews, and – in so doing – work towards a better online hotel rating system.

Five-star scales aren’t good enough

The five-star rating scale is a standard ranking system for many products and services.  It’s a fast way for consumers to check out the quality of something before paying the price, but it’s not always accurate.  Five-star ratings are inherently subjective, don’t let raters say all they want to say, and force a generic labelling for a complex experience.

Asking people to write text reviews solves a few of these problems.  Instead of relying on the five-star scale, we ask people to highlight the most memorable parts of their experience.  Still, who has the time to read hundreds of reviews to get a true sense of what a hotel is like?  Even the small sample of reviews shown on the first page could be unhelpful, unreliable or cleverly submitted by the hotel itself to make themselves look good. What we need is a way to summarize the review content – and ideally, we’d like a summary that’s specific to our own values.

Making a personalized hotel ranking system

Let’s take a look at a website that uses star ratings all the time: TripAdvisor, a travel resource that features an incredible wealth of user-reviewed hotels.

trip advisor front page

TripAdvisor uses AJAX-based pagination, which means that if we’d like to access that wealth of data, normal methods of screen-scraping won’t work.  I decided to use CasperJS, a fork of PhantomJS, to easily scrape the TripAdvisor website with JavaScript.

Here’s a simplified example of how you can use CasperJS to view the star ratings for every listed hotel from New York:

In a similar fashion, I used Python and CasperJS to retrieve a sample of 100 hotels each from five major locations – Italy, Spain, Thailand, the US, and the UK – and retrieved the top 400 English reviews from each hotel as listed on TripAdvisor.  To ensure that each stored review was in English, I relied on Python’s Natural Language Toolkit. Finally, I called on the Repustate API to analyze each review using six hotel-specific categories: food, price, location, accommodations, amenities, and staff.

sample of hotel rater

Check out the results in this “HotelRater” demo.  Select a location and a niche you care about, and you’ll see hotels organized in order of highest sentiment for that category.  To generate those results, the sentiment scores for each hotel across each category were averaged, and then placed on a scale from 1 to 100.  (I chose to take a mean of the sentiment values because it’s a value that’s easy to calculate and understand.)  The TripAdvisor five-star rating is shown for comparison.  You can also click on the hotels listed to see how Repustate categorized each of their reviews.

When I started putting the app I built into practice, I could suddenly make sense out of TripAdvisor’s abundance of data.  While a hotel might have a four star review on average, the customers were generally very happy with key aspects, such as its food, staff, and location.  Desirable hotels popped up in my search results that I might never have even seen or considered because of their lower average star rating on TripAdvisor.  The listed sentiment scores also helped to differentiate hotels, which I would have previously had trouble sorting through because they all shared the same 4.5 star rating.

This demo isn’t a replacement for TripAdvisor by any means, since there’s hardly enough stored data or options included to assist you in your quest for the perfect hotel on your vacation.  That said, it’s a positive step towards a new ranking system that’s aligned with our individual values.  We can quickly see how people are really feeling, without condensing their more specific thoughts into a blanket statement.

I’d give that five stars any day.

Sentiment Analysis Accuracy – Goal or Distraction?

Everything you need to know about sentiment analysis accuracy

When potential customers approach us, one of the first questions we’re asked is “How accurate is your sentiment analysis engine?”. Well as any good MBA graduate would tell you, the correct answer is “It depends.” That might sound like a cop-out and a way to avoid answering the question, but in fact, it’s the most accurate response one can give.

Let’s see why. Take this sentence:

“It is sunny today.”

Positive or negative? Most would perhaps say positive – Repustate would score this as being neutral as no opinion or intent is being stated, merely a fact. For those who would argue that this sentence is positive, let’s tweak this sentence:

“It is sunny today, my crops are getting dried out and will cause me to go bankrupt.”

Well, that changes things doesn’t it? Put into a greater context and the polarity (positive or negative sentiment) of a phrase can change dramatically. That’s why it is difficult to achieve 100% accuracy with sentiment.

Let’s take a look at another example. From a review of a horror movie:

“It will make you quake in fear.”

Positive or negative? Well, for a horror movie, this is a positive because horror movies are supposed to be scary. But if we were talking about watching a graphic video about torture, then the sentiment is negative. Again, context matters here more than the actual words themselves.

What about when we introduce conjunctions into the mix:

I thought the movie was great, but the popcorn was stale and too salty.

The first part of the sentence is positive, but the second part is negative. So what is the sentiment of this sentence? At Repustate, we would say “It depends.” The sentiment for the movie is positive; the sentiment surrounding the popcorn is negative. The question is: which topic are you interested in analyzing?

There is no one TRUE sentiment engine

As the previous examples have demonstrated, sentiment analysis is tricky and highly contextual. As a result, any benchmarks that companies post have to be taken with a pinch of salt. Repustate’s approach is the following:

  1. Make our global model as flexible as possible to catch as many cases as possible without being too inconsistent
  2. Allow customers the ability to load their own custom, domain-specific, definitions via our API (e.g. Make the phrase “quake in fear” positive for movie reviews)
  3. Allow sentiment to be scored in the context of only one topic, again, via our API

When shopping for sentiment analysis solutions, ask potential vendors about these points. Make sure whichever solution you end up going with can be tailored to your specific domain set because chances are your data has unique attributes and characteristics that might be overlooked or incorrectly accounted for a by a one-size-fits-all sentiment engine.

Or you can save yourself a lot of time and use Repustate’s flexible sentiment analysis engine.

Segmenting Twitter hashtags

Segmenting Twitter hashtags to gain insight

Twitter hashtags are everywhere these days and the ability to data mine them is an important one. The problem with hashtags is they are one long string that is composed of a few smaller ones. If we were able to segment the long hash tag into its individual words, we can gain some extra insight into the context of the tweet and maybe determine the sentiment as a result.

So what to do? How do we solve the problem of the long, single string?

Use the probabilities, Luke

As we did with Chinese sentiment, we had to rely on conditional probabilities to determine the most likely words in a string of characters. Put differently, you’re trying to answer the question: “If the previous word was X, what are the odds the next word is Y?” To answer this, you need to build up a probability model using some tagged data. We grabbed the most common bigrams from Google’s ngram database and then using the frequencies listed, constructed a probability model.

To better understand why we needed the probabilities, let’s take a look at a concrete example. Take the following hashtag: #vacationfallout. There are two possible segmentations here, ["vacation", "fallout"] or ["vacation", "fall", "out"]. So how we do know which to use? We examine the probability that the string “fallout” comes after “vacation”. This probability, as we know from our model, is higher than the probability of the words “fall” and “out” coming after “vacation”, so that’s the one we go with.

Now of course, since we’re dealing with probabilities, we might be wrong. Perhaps the author did intend for that hashtag to mean ["vacation", "fall", "out"]. But we learn to live with the fact that we’ll be wrong sometimes; the key is that we’ll be wrong much less frequently than when we’re right.

Memoizing to increase performance

Since the Repustate API is hit pretty heavily, we still need to be concerned with performance. The first step we take is to determine if there is a hashtag to expand. We do this using a simple regular expression. The next step, once we’ve determined there is a hashtag present, is to expand it into its individual words. To make things go a bit faster, we memoize the functions we use so that when we encounter the same patterns, and we do, we won’t waste time calculating things from scratch each time. Here’s the decorator we use to memoize in Python:

Using Python’s AST parser to translate text queries to regular expressions.

Python AST (Abstract Syntax Tree) module is pretty darn useful

We recently introduced a new set of API calls for our enterprise customers (all customers will soon have access to these APIs) that allows you to create customized rules for categorizing text. For example, let’s say you want to classify Twitter data into one of two categories: Photography or Photoshop. If it has to do with photography, that’s one category, if it has to do with Photoshop, that’s another category.

So we begin by listing out some boolean rules as to how we want to match our text. We can use the OR operator, we can use AND, we can use negation by placing a “-” (dash or hyphen) before a token and we can use parentheses to group pieces of logic together. Here are our definitions for the two categories:

Photography: photo OR camera OR picture
Photoshop: “Photoshop” -photo -shop

The first rule states if a tweet has at least one of the words “photo”, “camera” or “picture” then classify it as being in the Photograph category. The 2nd rule states if it has the word “Photoshop” and does not contain the words “photo” and “shop” by themselves, then this piece of text is under the Photoshop category. You’ll notice there’s an implicit AND operator where ever we use a white space to separate tokens.

Now one approach would be to take your text, tokenize it into a bag of words, and then go one by one through each of the boolean clauses and see which matches. But that’s way too slow. We can do this much faster using regular expressions.

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

The hilarity of the quote above not withstanding, this problem is ready made for regular expressions because we can compile our rules once at startup time and then just iterate over them for each piece of text. But how do you convert the category definitions above into a regular expression, using negative lookaheads/behinds and all that other regexp goodness? We used Python’s AST module.

AST to the rescue

Thinking back to your days in CS, you’ll remember that an expression like 2 + 2 can be parsed and converted into a tree, where the binary operator ‘+’ is the parent and it has two child nodes, namely ’2′ and ’2′. If we take our boolean expressions and replace OR with ‘+’ and the AND operator (whitespace) with a ‘*’, we can feed our text into the Python ast module like so:

The “process” method, shown below, is what then traverses the tree and emits the necessary regular expression text:

(I’ve omitted the code for some of the helper methods but what you see here is the heart of the algorithm). So the final regular expression for the two rules about would look like this:

Photograph: ’photo|camera|picture’
Photoshop: ’(?=.*(?=.*\\bPhotoshop\\b).*^((?!photo).)*$).*^((?!shop).)*$’

That second rule in particular is a doozy because it’s using lookarounds which are a pain in the butt to try to manually derive.

The AST module emits a tree where each node has a type. So when traversing the tree, we just have to check which type of node we’re dealing with and proceed accordingly. If it’s a binary operator for example, such as the OR operation, we know we have to put a pipe (i.e. “|”) between the two operands to form an “OR” in regular expression syntax. Similarly, AND and NOT are processed accordingly and since it’s a tree, we can do this recursively. Neat.

(More documentation on the AST module can be found here.)

The final product is a very fast regular expression which allows us to categorize text quickly and accurately. Next post, I’m going to talk about semantic categorization (e.g. Tag all pieces of text that have to do with football or baseball under the “Sports” category) so stay tuned!