Building a better TripAdvisor

(This is a guest post by Sarah Harmon. Sarah is currently a PhD student at UC Santa Cruz studying artificial intelligence and natural language processing. Her website is NeuroGirl.com)

Click here to see a demo of Sarah’s work  in action.

Using Repustate’s API to to get the most out of TripAdvisor customer reviews

I’m a big traveler, so I often check online ratings, such as those on TripAdvisor, to decide which local hotel or restaurant is worth my time.  In this post, I use Repustate’s text categorization API to analyze the sentiment of hotel reviews, and – in so doing – work towards a better online hotel rating system.

Five-star scales aren’t good enough

The five-star rating scale is a standard ranking system for many products and services.  It’s a fast way for consumers to check out the quality of something before paying the price, but it’s not always accurate.  Five-star ratings are inherently subjective, don’t let raters say all they want to say, and force a generic labelling for a complex experience.

Asking people to write text reviews solves a few of these problems.  Instead of relying on the five-star scale, we ask people to highlight the most memorable parts of their experience.  Still, who has the time to read hundreds of reviews to get a true sense of what a hotel is like?  Even the small sample of reviews shown on the first page could be unhelpful, unreliable or cleverly submitted by the hotel itself to make themselves look good. What we need is a way to summarize the review content – and ideally, we’d like a summary that’s specific to our own values.

Making a personalized hotel ranking system

Let’s take a look at a website that uses star ratings all the time: TripAdvisor, a travel resource that features an incredible wealth of user-reviewed hotels.

trip advisor front page

TripAdvisor uses AJAX-based pagination, which means that if we’d like to access that wealth of data, normal methods of screen-scraping won’t work.  I decided to use CasperJS, a fork of PhantomJS, to easily scrape the TripAdvisor website with JavaScript.

Here’s a simplified example of how you can use CasperJS to view the star ratings for every listed hotel from New York:

In a similar fashion, I used Python and CasperJS to retrieve a sample of 100 hotels each from five major locations – Italy, Spain, Thailand, the US, and the UK – and retrieved the top 400 English reviews from each hotel as listed on TripAdvisor.  To ensure that each stored review was in English, I relied on Python’s Natural Language Toolkit. Finally, I called on the Repustate API to analyze each review using six hotel-specific categories: food, price, location, accommodations, amenities, and staff.

sample of hotel rater

Check out the results in this “HotelRater” demo.  Select a location and a niche you care about, and you’ll see hotels organized in order of highest sentiment for that category.  To generate those results, the sentiment scores for each hotel across each category were averaged, and then placed on a scale from 1 to 100.  (I chose to take a mean of the sentiment values because it’s a value that’s easy to calculate and understand.)  The TripAdvisor five-star rating is shown for comparison.  You can also click on the hotels listed to see how Repustate categorized each of their reviews.

When I started putting the app I built into practice, I could suddenly make sense out of TripAdvisor’s abundance of data.  While a hotel might have a four star review on average, the customers were generally very happy with key aspects, such as its food, staff, and location.  Desirable hotels popped up in my search results that I might never have even seen or considered because of their lower average star rating on TripAdvisor.  The listed sentiment scores also helped to differentiate hotels, which I would have previously had trouble sorting through because they all shared the same 4.5 star rating.

This demo isn’t a replacement for TripAdvisor by any means, since there’s hardly enough stored data or options included to assist you in your quest for the perfect hotel on your vacation.  That said, it’s a positive step towards a new ranking system that’s aligned with our individual values.  We can quickly see how people are really feeling, without condensing their more specific thoughts into a blanket statement.

I’d give that five stars any day.

Sentiment Analysis Accuracy – Goal or Distraction?

Everything you need to know about sentiment analysis accuracy

When potential customers approach us, one of the first questions we’re asked is “How accurate is your sentiment analysis engine?”. Well as any good MBA graduate would tell you, the correct answer is “It depends.” That might sound like a cop-out and a way to avoid answering the question, but in fact, it’s the most accurate response one can give.

Let’s see why. Take this sentence:

“It is sunny today.”

Positive or negative? Most would perhaps say positive – Repustate would score this as being neutral as no opinion or intent is being stated, merely a fact. For those who would argue that this sentence is positive, let’s tweak this sentence:

“It is sunny today, my crops are getting dried out and will cause me to go bankrupt.”

Well, that changes things doesn’t it? Put into a greater context and the polarity (positive or negative sentiment) of a phrase can change dramatically. That’s why it is difficult to achieve 100% accuracy with sentiment.

Let’s take a look at another example. From a review of a horror movie:

“It will make you quake in fear.”

Positive or negative? Well, for a horror movie, this is a positive because horror movies are supposed to be scary. But if we were talking about watching a graphic video about torture, then the sentiment is negative. Again, context matters here more than the actual words themselves.

What about when we introduce conjunctions into the mix:

I thought the movie was great, but the popcorn was stale and too salty.

The first part of the sentence is positive, but the second part is negative. So what is the sentiment of this sentence? At Repustate, we would say “It depends.” The sentiment for the movie is positive; the sentiment surrounding the popcorn is negative. The question is: which topic are you interested in analyzing?

There is no one TRUE sentiment engine

As the previous examples have demonstrated, sentiment analysis is tricky and highly contextual. As a result, any benchmarks that companies post have to be taken with a pinch of salt. Repustate’s approach is the following:

  1. Make our global model as flexible as possible to catch as many cases as possible without being too inconsistent
  2. Allow customers the ability to load their own custom, domain-specific, definitions via our API (e.g. Make the phrase “quake in fear” positive for movie reviews)
  3. Allow sentiment to be scored in the context of only one topic, again, via our API

When shopping for sentiment analysis solutions, ask potential vendors about these points. Make sure whichever solution you end up going with can be tailored to your specific domain set because chances are your data has unique attributes and characteristics that might be overlooked or incorrectly accounted for a by a one-size-fits-all sentiment engine.

Or you can save yourself a lot of time and use Repustate’s flexible sentiment analysis engine.