What people talk about on Facebook

A lot of interesting things … and some not so interesting things.

I’ve been reading a lot of Facebook status updates over the past week or two, trying to get a good grip on the kinds of things people talk about on Facebook and how Repustate can leverage this avalanche of data. While there’s a lot of useful nuggets of information, there’s a lot of nonsense. Here’s a quick summary for those who don’t want to read thousands of other people’s status messages:

    • A lof of people (presumably younger ones) hate either their lives, or their moms or both. They also can’t see how their life could be any worse than it already is.
    • A lot of people pray to God/Jesus/Jebus for all sorts of things including: doing well on a test, good weather, and for a boy/girl to¬†reciprocate¬†their feelings
    • Playing off the same religious theme (and people on Facebook sure are preachy!) many people took part in some viral meme where they “Wish Heaven had a phone” so they could send a message to a loved one whose passed. In fact, of 200K Facebook messages analyzed, more than 9K wished Heaven had a phone. Perhaps AT&T is missing out on a huge new customer base.
    • Many male users are fed up with their female counterparts abusing their financial¬†generosity. Of course, they don’t state it in such terms, but that’s the gist of things.
    • Many female users are fed up with being emotionally abused and “played” by their male counterparts.
    • And a alarmingly large number of people have tried to sell their younger siblings on Facebook (e.g. Does anybody want to buy my younger brother? He’s so annoying!!!!1111)

There you have it. I have scoured the depths of social networking and filtered out the noise so you don’t have to.

Data (and pattern recognition) > Algorithms

No matter how clever your algorithms and heuristics, having a ton of data trumps all.

We’ve been mixing ingredients for a while at Repustate, trying to come up with the perfect recipe for our problem. Our goal is to determine what somebody is intending to buy based on what they write on various social media outlets. For example, “I want to buy a bike” is a statement of purchase intent, while “There goes the last bus” is not. We went about trying to solve this problem by taking a small sample of data, manually tagging it as being purchase intent or not, and then hoping to extrapolate. This is generally the approach taken in the world of NLP. We derived all sorts of clever rules-based logic to try to accurately predict this intent. Turns out, it’s next to impossible to do this accurately based solely on “state-of-the-art” algorithms and systems.

You see, it’s not that the algorithms and methodologies that academics in the NLP world publish are useless; it’s just that they’re not useful when applied to more general sets of documents. Let me explain.

Academics generally take a corpus of text from a very specific subset of language (movie reviews are always a popular choice) and train their machine learning systems on that. Their results are often great (some > 90% accuracy) but I view these as contrived. Movie reviews or articles from Reutuers’ archive, another popular source, are well written pieces of English text. The sentences are structured properly, there is proper subject-verb agreement etc. In other words, it’s almost the complete opposite of what you read on Twitter, Facebook or a blog comment.

The reality is that if you want to accurately tag social media content with semantic meaning, you can’t rely on the traditional means of learning. You have to do what Google does and amass a ridiculous amount of data, find patterns, and then try to predict. Google Instant is an example of this. Their predictions of what you’re trying to search for are based on the billions of n-grams they’ve generated over the past decade. When you type “How can I get my girlfriend to”, with a high degree of probability, Google knows what you’re going to type next. I won’t reproduce here the possibilities Google shows as that’s more than slightly NSFW.

And that’s what Repustate is doing now. Grabbing loads of data and categorizing it manually. It’s the only way.