Using python requests – whose encoding is it anyway?

Python requests encoding – using the Python requests module might give you surprising results

If you do anything remotely related to HTTP and use Python, then you’re probably using requests, the amazing library by Kenneth Reitz. It’s to Python HTTP programming what jQuery is to Javascript DOM manipulation – once you use it, you wonder how you ever did without it.

But there’s a subtle issue with regards to encodings that tripped us up. A customer told us that some Chinese web pages were coming back garbled when using the clean-html API call we provide. Here’s the URL:

http://finance.sina.com.cn/china/20140208/111618150293.shtml

In the HTML on these pages, the charset is gb2312 which is an encoding that came out of China used for the Simplified Chinese set of characters. However, many web servers do not send this as the charset in the response headers (due to the programmers, not the web server itself). As a result, requests defaults to ISO 8851-9 as the encoding when the response doesn’t contain a charset. This is done in accordance with RFC 2616. The upshot is that the Chinese text in the web page doesn’t get encoded properly when you access the encoded content of the response and so what you see is garbled characters.

Here’s the response headers for the above URL:

curl -I http://finance.sina.com.cn/china/20140208/111618150293.shtml
HTTP/1.1 200 OK

Content-Type: text/html
Vary: Accept-Encoding
X-Powered-By: schi_v1.02
Server: nginx
Date: Mon, 17 Feb 2014 15:54:28 GMT
Last-Modified: Sat, 08 Feb 2014 03:56:49 GMT
Expires: Mon, 17 Feb 2014 15:56:28 GMT
Cache-Control: max-age=120
Content-Length: 133944
X-Cache: HIT from 236-41.D07071951.sina.com.cn

There is a thread on the Github repository for requests that explains why they do this – requests shouldn’t be about HTML, the argument goes, it’s about HTTP so if a server doesn’t respond with the proper charset declaration, it’s up to the client (or the developer) to figure out what to do. That’s a reasonable position to take, but it poses an interesting question: When “common” use or expectations, go against official spec, whose side does one take? Do you tell developers to put on their big boy and girl pants and deal with it or do you acquiesce and just do what most people expect/want?

Specs be damned, make it easy for people

I believe it was former Twitter API lead at the time, Alex Payne, who was asked why does Twitter include the version of the API in the URL rather than in the request header, as is more RESTful. His paraphrased response (because I can’t find the quote) is that Twitter’s goal was to get as many people using the API as possible and settings headers was beyond the skill level of many developers, whereas including it in the URL is dead simple. (We at Repustate do the same thing; our APIs are versioned via the URL. It’s simpler and more transparent.)

Now the odd thing about requests is that the package has an attribute called apparent_encoding which does correctly guess the charset based on the content of the response. It’s just not automatically applied because the response header takes precedence.

We ended up patching requests so that the apparent_encoding attribute is what gets used in the case no headers are set by default, but this is not the default behaviour of the package.

I can’t say I necessarily disagree with the choices the maintainers of requests have made. I’m not sure if there is a right answer because if you write your code to be user friendly in direct opposition to a published spec, you will almost certainly raise the ire of someone who *does* expect things to work to spec. Damned if you do, damned if you don’t.

Social sentiment analysis is missing something

Sentiment analysis by itself doesn’t suffice

An interesting article by Seth Grimes caught our eye this week. Seth is one of the few voices of reason in the world of text analytics that I feel “gets it”. His views on sentiment’s strengths and weaknesses, advantages and shortcomings align quite perfectly with Repustate’s general philosophy.

In the article, Seth states that simply getting relying on a number denoting sentiment or a label like “positive” or “negative” is too coarse a measurement and doesn’t carry any meaning with it. By doing so, you risk overlooking deeper insights that are hidden beneath the high level sentiment score. Couldn’t agree more with this and that’s why Repustate supports categorizations.

Sentiment by itself is meaningless; sentiment analysis scoped to a particular business need or product feature etc. is where true value lies. Categorizing your social data by features of your service (e.g. price, selection, quality) first and THEN applying sentiment analysis is the way to go. In the article, Seth proceeds to list a few “emotional” ones (promoter/detractor, angry, happy etc). that quite frankly I would ignore. These categories are too touchy-feely, hard to really disambiguate at a machine learning level and don’t tie closely to actual business processes/features. For instance, if someone is a detractor, what is that is causing them to be a detractor? Was it the service they received? If so, then customer service is the category you want and negative polarity of the text in question gives you invaluable insights. The fact that someone is being negative about your business means almost by definition they are detractors.

Repustate provides our customers with the ability to create their own categories according to the definitions that they create. Each customer is different, each business is different, hence the need for customized categories. Once you have your categories, sentiment analysis becomes much more insightful and valuable to your business.

Sharing large data structures across processes in Python

At Repustate, much of our data models we use in our text analysis can be represented as simple key-value pairs, or dictionaries in Python lingo. In our particular case, our dictionaries are massive, a few hundred MB each, and they need to be accessed constantly. In fact for a given HTTP request, 4 or 5 models might be accessed, each doing 20-30 lookups. So the problem we face is how do we keep things fast for the client as well as light as possible for the server. We also distribute our software as virtual machines to some customers so memory usage has to be light because we can’t control how much memory our customers will allocate to the VMs they deploy.

To summarize, here’s our checklist of requirements:

  1. Low memory footprint
  2. Can be shared amongst multiple processes with no issues (read only)
  3. Very fast access
  4. Easy to update (write) out of process

So our first attempt was to store the models on disk in a MongoDB and to load them into memory as Python dictionaries. This worked and satisfied #3 and #4 but failed #1 and #2. This is how Repustate operated for a while, but memory usage kept growing and it became unsustainable. Python dictionaries are not memory efficient. And it was too expensive for each Apache process to need a copy of this since we were not sharing the data between processes.

One night I was complaining about our dilemma and a friend of mine, who happens to be a great developer at Red Hat, said these three words: “memory mapped file”. Of course! In fact, Repustate already uses memory mapped files but I completely forgot about this. So that solves half my problem – it meets requirements #2. But what format does the memory mapped file take? Thankfully computer science has already solved all the world’s problems and the perfect data structure was already out there: tries.

Tries (pronounced “trees” for some reason and not “try’s”) AKA radix trees AKA prefix trees are a data structure that lend themselves to objects that need string keys. Wikipedia has a better write up but long story short, tries are great for the type of models Repustate uses.

I found this package, marisa tries, which is a Python wrapper around a C++ implementation of a marisa trie. “Marisa” is an acronym for Matching Algorithm with Recursively Implemented StorAge. What’s great about marisa tries is the storage mechanism really shrinks how much memory you need. The author of the Python plugin claimed 50-100X reduction in size – our experience is similar.

What’s great about the marisa trie package is that the underlying trie structure can be written to disk and then read in via a memory mapped object. With a memory mapped marisa trie, all of our requirements are now met. Our server’s memory usage went down dramatically, by about 40%, and our performance was unchanged from when we used Python’s dictionary implementation.

Next time you’re in need of sharing large amounts of data, give memory mapped tries a chance.

 

Chinese POS Tagger (and other languages)

Need an Arabic part of speech tagger (AKA an Arabic POS Tagger)? How about German or Italian? You’re in luck – Repustate’s internal POS taggers have been opened up via our API to give our developers the ability to slice and dice multilingual text the way we do.

The documentation for the POS tagger API call outlines all you need to know to get started. With this new API call you get:

  • English POS tagger
  • German POS tagger
  • Chinese POS tagger
  • French POS tagger
  • Italian POS tagger
  • Spanish POS tagger
  • Arabic POS tagger

Beyond this, we’ve unified the tag set you get from the various POS taggers so that you only have to write code once to handle all languages. The complete tag set includes nouns, adjectives, verbs, adverbs, punctuation marks, conjunctions and prepositions. Give it a shot and let us know what you think.

Twitter + GeoJSON + d3.js + Rob Ford = Fun with data visualizations

Using Twitter, GeoJSON (via GDAL) and d3 to visualize data about Rob Ford

(Note: Gists turned into links so as to avoid too many roundtrips to github.com which is slow sometimes)

As a citizen of Toronto, the past few weeks (and months) have been interesting to say the least. Our mayor, Rob Ford, has made headlines for his various lewd remarks, football follies, and drunken stupors. With an election coming up in 2014, I was curious as to how the rest of Toronto felt about our mayor.

The Task

Collect data from Twitter, making sure we only use tweets from people in Toronto. Plot those tweets on an electoral map of Toronto’s wards, colour coding the wards based on how they voted in the 2010 Mayoral Election.

The Tools

  1. Repustate’s API (for data mining Twitter and extracting sentiment, of course!)
  2. Shapefiles for the City of Toronto including boundaries for the wards
  3. GDAL
  4. d3.js
  5. jQuery

I’m impatient, just show me the final product …

Alright, here’s the final visualization of Toronto’s opinion of Rob Ford. Take a look and then come back here and find out more about how we accomplished this.

The Process

For those still with me, let’s go through the step-by-step of how this visualization was accomplished.

Step 1: Get the raw data from Twitter

This is a given – if we want to visualize data, first we need to get some! To do so, we used Repustate to create our rules and monitor Twitter for new data. You can monitor social media data using Repustate’s web interface, but this is a project for hackers, so let’s use the Social API to do this. Here’s the code (you first have to get an API key):

https://gist.github.com/mostrovsky/7762449

Alright, so while Repustate gets our data, we can get the geo visualization aspect of this started. We’ll check back in on the data – for now, let’s delve into the world of geographic information systems and analytics.

Step 2: Get the shapefile

What’s a shapefile? Good question! Paraphrasing Wikipedia, shapefiles are a file format that contain vector information describing the layout and/or location of geographic features, like cities, states, provinces, rivers, lakes, mountains, or even neighbourhoods. Thanks to a recent movement to open up data to the public, it’s not *too* hard to get your hands on shape files anymore. To get the shapefiles for the City of Toronto, we went to the City’s OpenData website, which has quite a bit of cool data there to play with. The shapefile we need, the one that shows Toronto divided up into it’s electoral wards, is right here. Download it!

Step 3: Convert the shapefile to GeoJSON

So we have our shapefile, but it’s in a vector format that’s not exactly ready to be imported into a web page. The next step is to convert this file into a web-friendly format. To do this, we need to convert the data in the shapefile into a JSON-inspired format, called GeoJSON. GeoJSON looks just like normal JSON (it is normal JSON), but it has a spec that defines what a valid GeoJSON object must contain. Luckily, there’s some open source software that will do this for us: the Geospatial Data Abstraction Library or GDAL. GDAL has *tons* of cool stuff in it, take a look when you have time, but for now, we’re just going to use the ogr2ogr command that takes a shapefile in and spits out GeoJSON. Armed with GDAL and our shapefile, let’s convert the data into GeoJSON:

ogr2ogr 
  -f GeoJSON 
  -t_srs EPSG:4326
  toronto.json 
  tcl3_icitw.shp

This new file, toronto.json, is a GeoJSON file that contains the co-ordinates for drawing the polygons representing the 44 wards of Toronto. Let’s take a look at that one liner just so you know what’s going on here:

  1. ogr2ogr is the name of the utility we’re using; it comes with GDAL
  2. Specifying the format of the final output (-f option)
  3. This step is *very* key, we’re making sure the output is projected/translated into the familiar lat/long co-ordinate system. If you omit this (in fact, try omitting this flag), you’ll see the output will require some work from us on the front end, which is unnecessary.
  4. The name of the output file
  5. The input shapefile

Alright, GeoJSON file done.

Step 4: Download the data

Assuming enough time has passed, Repustate will have found some data for us and we can retrieve it via the Social API:

https://gist.github.com/mostrovsky/7762900

The data comes down as a CSV. It’ll remain an exercise for the reader to store the data in a format that you can query and retrieve thereafter, but let’s just assume we’ve stored our data. Included in this data dump will be the latitude and longitude of each tweet(er). How do we know if this tweet(er) is in Toronto? There’s a few ways actually.

We could use PostGIS, an unbelievably amazing extension for PostgreSQL that allows you to do all sorts of interesting queries with geographic data. If we went this route, we could load the original shapefile into a PostGIS enabled database, then for each tweet, use the ST_Contains method to check which tweet(er)s are in Toronto. Another way to do this is to iterate over your dataset and for each one, do a quick point in polygon calculation in the language of your choice. The polygon data can be retrieved from the GeoJSON file we created. Either way you go, it’s not too hard to determine which data originated from Toronto, given specified lat/long co-ordinates.

Step 5: Visualize the data

Let’s see how this data looks. To plot the GeoJSON data and each of the data points from our Twitter data source, we use d3.js For those not familiar with d3.js (Data Driven Documents) – it’s a Javascript library that helps you bind data to graphical representations in pure HTML/SVG. It’s a great library to use if you want to visualize any data really, but it makes creating unique and customized maps pretty easy. And best of all, since the output is SVG, you’re pretty much guaranteed that your visuals will look identical on all browsers (well, IE 8 and down don’t support SVG) and mobile devices. Let’s dive into d3.js. We need to plot 2 things:

  1. The map of Toronto, with dividers for each of the wards
  2. The individual tweets, colour coded for sentiment and placed in the correct lat/long location.

The data for the map comes from the GeoJSON file we created earlier. We’re going to tell d3 to load it and for each ward, we’re going to draw a polygon (the “path” node in SVG) by following the path co-ordinates in the GeoJSON file. Here’s the code:

https://gist.github.com/mostrovsky/7776778

Path elements have an attribute, “d”, which describes how the path should be drawn. You can read more about that here. So for each ward, we create a path node, set the “d” attribute to contain the lat/long co-ordinates that when connected, form the boundaries of the ward. You’ll see we also have custom fill colouring based on the 2010 Mayoral election – that’s just an extra touch we added. You can play around with the thickness of the bordering lines. We’re also adding a mouseover event to show a little tooltip about the ward in question when the user hovers over it. Time to add the data points. We’re going to separate them by date, so we can add a slider later to show how the data changes over time. We’re also including sentiment information that Repustate automatically calculates for each piece of social data. Here’s an example of how one complete data point would look:

{
 "lat": 43.788458853157117, 
 "mm": 10,
 "lng": -79.282781549043008,
 "dd": 28,
 "sentiment": "pos"
}

Now to plot these on our map, we again load a JSON file and for each point, we bind a “circle” object, positioned at the correct lat/long co-ordinates. Here’s the code:

https://gist.github.com/mostrovsky/7776799

At this point, if you tried to render the data on screen, you wouldn’t see a thing, just a blank white SVG. That’s because you need to deal with the concept of map projections. If you think about the map of the world, there’s no “true” way to render it on a flat, 2D surface, like your laptop screen or iPad. This has given rise to the concept of projections, which is the process of placing curved data onto a flat plane. There are many ways to do this and you can see a list here of available projections; for our example, we used the Albers projection. Last thing we have to do is scale and transform our SVG and we’re done. Here’s the code for projecting, scaling and transforming our map:

https://gist.github.com/mostrovsky/7776853

Those values are not random – they came about from some trial & error. I’m not sure how to do this properly or in algorithmic way.

Please take a look at the finished product and of course, if you or your company is looking into doing this sort of thing with your own data set, contact us!

Building a better TripAdvisor

(This is a guest post by Sarah Harmon. Sarah is currently a PhD student at UC Santa Cruz studying artificial intelligence and natural language processing. Her website is NeuroGirl.com)

Click here to see a demo of Sarah’s work  in action.

Using Repustate’s API to to get the most out of TripAdvisor customer reviews

I’m a big traveler, so I often check online ratings, such as those on TripAdvisor, to decide which local hotel or restaurant is worth my time.  In this post, I use Repustate’s text categorization API to analyze the sentiment of hotel reviews, and – in so doing – work towards a better online hotel rating system.

Five-star scales aren’t good enough

The five-star rating scale is a standard ranking system for many products and services.  It’s a fast way for consumers to check out the quality of something before paying the price, but it’s not always accurate.  Five-star ratings are inherently subjective, don’t let raters say all they want to say, and force a generic labelling for a complex experience.

Asking people to write text reviews solves a few of these problems.  Instead of relying on the five-star scale, we ask people to highlight the most memorable parts of their experience.  Still, who has the time to read hundreds of reviews to get a true sense of what a hotel is like?  Even the small sample of reviews shown on the first page could be unhelpful, unreliable or cleverly submitted by the hotel itself to make themselves look good. What we need is a way to summarize the review content – and ideally, we’d like a summary that’s specific to our own values.

Making a personalized hotel ranking system

Let’s take a look at a website that uses star ratings all the time: TripAdvisor, a travel resource that features an incredible wealth of user-reviewed hotels.

trip advisor front page

TripAdvisor uses AJAX-based pagination, which means that if we’d like to access that wealth of data, normal methods of screen-scraping won’t work.  I decided to use CasperJS, a fork of PhantomJS, to easily scrape the TripAdvisor website with JavaScript.

Here’s a simplified example of how you can use CasperJS to view the star ratings for every listed hotel from New York:

In a similar fashion, I used Python and CasperJS to retrieve a sample of 100 hotels each from five major locations – Italy, Spain, Thailand, the US, and the UK – and retrieved the top 400 English reviews from each hotel as listed on TripAdvisor.  To ensure that each stored review was in English, I relied on Python’s Natural Language Toolkit. Finally, I called on the Repustate API to analyze each review using six hotel-specific categories: food, price, location, accommodations, amenities, and staff.

sample of hotel rater

Check out the results in this “HotelRater” demo.  Select a location and a niche you care about, and you’ll see hotels organized in order of highest sentiment for that category.  To generate those results, the sentiment scores for each hotel across each category were averaged, and then placed on a scale from 1 to 100.  (I chose to take a mean of the sentiment values because it’s a value that’s easy to calculate and understand.)  The TripAdvisor five-star rating is shown for comparison.  You can also click on the hotels listed to see how Repustate categorized each of their reviews.

When I started putting the app I built into practice, I could suddenly make sense out of TripAdvisor’s abundance of data.  While a hotel might have a four star review on average, the customers were generally very happy with key aspects, such as its food, staff, and location.  Desirable hotels popped up in my search results that I might never have even seen or considered because of their lower average star rating on TripAdvisor.  The listed sentiment scores also helped to differentiate hotels, which I would have previously had trouble sorting through because they all shared the same 4.5 star rating.

This demo isn’t a replacement for TripAdvisor by any means, since there’s hardly enough stored data or options included to assist you in your quest for the perfect hotel on your vacation.  That said, it’s a positive step towards a new ranking system that’s aligned with our individual values.  We can quickly see how people are really feeling, without condensing their more specific thoughts into a blanket statement.

I’d give that five stars any day.

Sentiment Analysis Accuracy – Goal or Distraction?

Everything you need to know about sentiment analysis accuracy

When potential customers approach us, one of the first questions we’re asked is “How accurate is your sentiment analysis engine?”. Well as any good MBA graduate would tell you, the correct answer is “It depends.” That might sound like a cop-out and a way to avoid answering the question, but in fact, it’s the most accurate response one can give.

Let’s see why. Take this sentence:

“It is sunny today.”

Positive or negative? Most would perhaps say positive – Repustate would score this as being neutral as no opinion or intent is being stated, merely a fact. For those who would argue that this sentence is positive, let’s tweak this sentence:

“It is sunny today, my crops are getting dried out and will cause me to go bankrupt.”

Well, that changes things doesn’t it? Put into a greater context and the polarity (positive or negative sentiment) of a phrase can change dramatically. That’s why it is difficult to achieve 100% accuracy with sentiment.

Let’s take a look at another example. From a review of a horror movie:

“It will make you quake in fear.”

Positive or negative? Well, for a horror movie, this is a positive because horror movies are supposed to be scary. But if we were talking about watching a graphic video about torture, then the sentiment is negative. Again, context matters here more than the actual words themselves.

What about when we introduce conjunctions into the mix:

I thought the movie was great, but the popcorn was stale and too salty.

The first part of the sentence is positive, but the second part is negative. So what is the sentiment of this sentence? At Repustate, we would say “It depends.” The sentiment for the movie is positive; the sentiment surrounding the popcorn is negative. The question is: which topic are you interested in analyzing?

There is no one TRUE sentiment engine

As the previous examples have demonstrated, sentiment analysis is tricky and highly contextual. As a result, any benchmarks that companies post have to be taken with a pinch of salt. Repustate’s approach is the following:

  1. Make our global model as flexible as possible to catch as many cases as possible without being too inconsistent
  2. Allow customers the ability to load their own custom, domain-specific, definitions via our API (e.g. Make the phrase “quake in fear” positive for movie reviews)
  3. Allow sentiment to be scored in the context of only one topic, again, via our API

When shopping for sentiment analysis solutions, ask potential vendors about these points. Make sure whichever solution you end up going with can be tailored to your specific domain set because chances are your data has unique attributes and characteristics that might be overlooked or incorrectly accounted for a by a one-size-fits-all sentiment engine.

Or you can save yourself a lot of time and use Repustate’s flexible sentiment analysis engine.

Segmenting Twitter hashtags

Segmenting Twitter hashtags to gain insight

Twitter hashtags are everywhere these days and the ability to data mine them is an important one. The problem with hashtags is they are one long string that is composed of a few smaller ones. If we were able to segment the long hash tag into its individual words, we can gain some extra insight into the context of the tweet and maybe determine the sentiment as a result.

So what to do? How do we solve the problem of the long, single string?

Use the probabilities, Luke

As we did with Chinese sentiment, we had to rely on conditional probabilities to determine the most likely words in a string of characters. Put differently, you’re trying to answer the question: “If the previous word was X, what are the odds the next word is Y?” To answer this, you need to build up a probability model using some tagged data. We grabbed the most common bigrams from Google’s ngram database and then using the frequencies listed, constructed a probability model.

To better understand why we needed the probabilities, let’s take a look at a concrete example. Take the following hashtag: #vacationfallout. There are two possible segmentations here, [“vacation”, “fallout”] or [“vacation”, “fall”, “out”]. So how we do know which to use? We examine the probability that the string “fallout” comes after “vacation”. This probability, as we know from our model, is higher than the probability of the words “fall” and “out” coming after “vacation”, so that’s the one we go with.

Now of course, since we’re dealing with probabilities, we might be wrong. Perhaps the author did intend for that hashtag to mean [“vacation”, “fall”, “out”]. But we learn to live with the fact that we’ll be wrong sometimes; the key is that we’ll be wrong much less frequently than when we’re right.

Memoizing to increase performance

Since the Repustate API is hit pretty heavily, we still need to be concerned with performance. The first step we take is to determine if there is a hashtag to expand. We do this using a simple regular expression. The next step, once we’ve determined there is a hashtag present, is to expand it into its individual words. To make things go a bit faster, we memoize the functions we use so that when we encounter the same patterns, and we do, we won’t waste time calculating things from scratch each time. Here’s the decorator we use to memoize in Python:

Using Python’s AST parser to translate text queries to regular expressions.

Python AST (Abstract Syntax Tree) module is pretty darn useful

We recently introduced a new set of API calls for our enterprise customers (all customers will soon have access to these APIs) that allows you to create customized rules for categorizing text. For example, let’s say you want to classify Twitter data into one of two categories: Photography or Photoshop. If it has to do with photography, that’s one category, if it has to do with Photoshop, that’s another category.

So we begin by listing out some boolean rules as to how we want to match our text. We can use the OR operator, we can use AND, we can use negation by placing a “-” (dash or hyphen) before a token and we can use parentheses to group pieces of logic together. Here are our definitions for the two categories:

Photography: photo OR camera OR picture
Photoshop: “Photoshop” -photo -shop

The first rule states if a tweet has at least one of the words “photo”, “camera” or “picture” then classify it as being in the Photograph category. The 2nd rule states if it has the word “Photoshop” and does not contain the words “photo” and “shop” by themselves, then this piece of text is under the Photoshop category. You’ll notice there’s an implicit AND operator where ever we use a white space to separate tokens.

Now one approach would be to take your text, tokenize it into a bag of words, and then go one by one through each of the boolean clauses and see which matches. But that’s way too slow. We can do this much faster using regular expressions.

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

The hilarity of the quote above not withstanding, this problem is ready made for regular expressions because we can compile our rules once at startup time and then just iterate over them for each piece of text. But how do you convert the category definitions above into a regular expression, using negative lookaheads/behinds and all that other regexp goodness? We used Python’s AST module.

AST to the rescue

Thinking back to your days in CS, you’ll remember that an expression like 2 + 2 can be parsed and converted into a tree, where the binary operator ‘+’ is the parent and it has two child nodes, namely ‘2’ and ‘2’. If we take our boolean expressions and replace OR with ‘+’ and the AND operator (whitespace) with a ‘*’, we can feed our text into the Python ast module like so:

The “process” method, shown below, is what then traverses the tree and emits the necessary regular expression text:

(I’ve omitted the code for some of the helper methods but what you see here is the heart of the algorithm). So the final regular expression for the two rules about would look like this:

Photograph: ‘photo|camera|picture’
Photoshop: ‘(?=.*(?=.*\bPhotoshop\b).*^((?!photo).)*$).*^((?!shop).)*$’

That second rule in particular is a doozy because it’s using lookarounds which are a pain in the butt to try to manually derive.

The AST module emits a tree where each node has a type. So when traversing the tree, we just have to check which type of node we’re dealing with and proceed accordingly. If it’s a binary operator for example, such as the OR operation, we know we have to put a pipe (i.e. “|”) between the two operands to form an “OR” in regular expression syntax. Similarly, AND and NOT are processed accordingly and since it’s a tree, we can do this recursively. Neat.

(More documentation on the AST module can be found here.)

The final product is a very fast regular expression which allows us to categorize text quickly and accurately. Next post, I’m going to talk about semantic categorization (e.g. Tag all pieces of text that have to do with football or baseball under the “Sports” category) so stay tuned!

Provisioning virtual appliances with Vagrant

Virtual Appliances

Repustate’s API is great for 95% of our customers – but for that other 5%, we needed something else. The reason being is that the 5% are enterprise customers who cannot transmit their data (for both legal and security reasons) across the public internet, even with HTTPS. So we created the Repustate Server product which is our API housed within a virtual appliance, distributed for either VMware or VirtualBox. Since it’s a virtual appliance, we’re really just shipping out a self-contained OS with all the bits & pieces installed, configured and ready to use. No installation, no compilation, just download the Server instance, fire it up, and you’re off. Sounds great, right, what’s the problem? Well, provisioning demos, assigning static IPs, and provisioning multiple instances can get tricky and tedious.

Demos

Many customers request demos of the Repustate Server. Our demos need to be configured with some data to disable the virtual appliance after a certain date. (Since one of our selling points is that the Repustate Server requires no external internet access, all configurations must be included, no “phoning-home” allowed).

Furthermore, details about the customer need to be embedded into the virtual appliance. At first, we would clone a “base” virtual machine, edit the various files, package it up, upload it somewhere and then send a link. Yuck. It took way too long and we had to keep making sure the virtual appliance was up-to-date with all of the recent changes to the Repustate code base (multiple code repositories, precompiled binaries etc.)

Static IPs

Many customers could not just fire up a virtual appliance and use DHCP to assign the IP to the virtual appliance. So there was a need for static IPs. A customer would specify the various network settings they needed, we would configure the virtual appliance, and repeat the previous steps of packaging, uploading, and sending out a link to download. Yuck, even more manual, tedious fiddling around.

(Note: I know there’s a tool called VMware Network Editor but apparently it is not included with VM Player, only VM Workstation and some customers for whatever reason don’t have access to the Editor tool, hence the need for us to configure static IPs for them within the guest OS)

Multiple Instances

Our larger customers use multiple Repustate Servers in parallel to form a cluster of machines for the purposes of increasing throughput. If you’re trying to consume and analyze a steady stream of text you’re going to need some decent processing power to keep up, especially if you want things to be synchronous. To that end, our customers put a load balancer up in front of several Repustate Server instances and now they have their own text analytics cluster chugging away with little-to-no maintenance. Ah, but when our customers want a new instance to throw into the cluster? They can’t just clone an existing one because of the static IP issue described above. So they have to ask us for a new instance. Yuck.

Vagrant to the rescue

Alright, this was a long setup for the payoff – Vagrant. Vagrant allows you to programmatically (i.e. from the shell, no GUI required) create new instances of a virtual machine, for either VirtualBox or VMware. You can completely configure the new instances using Chef or Puppet to download whatever code you need, compile any packages you want etc. It completely automates the role of server admin – well, at least the server setup part. With Vagrant, all customizations Repustate needs to perform for each customer are just automated with Chef. For example, say we need to know when a demo expires. We have a chef recipe that queries our database for the license expiration date, retrieves it, and then stores it in the new virtual appliance.

Customers can now create their Repustate Server instances on their own using a simple form in the Repustate dashboard:

provisioning

This includes setting their own IP settings for those who can’t use DHCP. Once the customer requests their new instance, Vagrant fires up (asynchronously, just like asking Amazon EC2 for a new instance) and begins provisioning the new instance. Once finished, an email is fired off with a link to download the new virtual appliance. Eventually, we’ll add an API call to this so it can all be programmatically done without the need to log in to the dashboard.

A massive time saver for us at Repustate – big thanks to Mitchell for putting Vagrant together.