What people talk about on Facebook

A lot of interesting things … and some not so interesting things.

I’ve been reading a lot of Facebook status updates over the past week or two, trying to get a good grip on the kinds of things people talk about on Facebook and how Repustate can leverage this avalanche of data. While there’s a lot of useful nuggets of information, there’s a lot of nonsense. Here’s a quick summary for those who don’t want to read thousands of other people’s status messages:

    • A lof of people (presumably younger ones) hate either their lives, or their moms or both. They also can’t see how their life could be any worse than it already is.
    • A lot of people pray to God/Jesus/Jebus for all sorts of things including: doing well on a test, good weather, and for a boy/girl to reciprocate their feelings
    • Playing off the same religious theme (and people on Facebook sure are preachy!) many people took part in some viral meme where they “Wish Heaven had a phone” so they could send a message to a loved one whose passed. In fact, of 200K Facebook messages analyzed, more than 9K wished Heaven had a phone. Perhaps AT&T is missing out on a huge new customer base.
    • Many male users are fed up with their female counterparts abusing their financial generosity. Of course, they don’t state it in such terms, but that’s the gist of things.
    • Many female users are fed up with being emotionally abused and “played” by their male counterparts.
    • And a alarmingly large number of people have tried to sell their younger siblings on Facebook (e.g. Does anybody want to buy my younger brother? He’s so annoying!!!!1111)

There you have it. I have scoured the depths of social networking and filtered out the noise so you don’t have to.

Data (and pattern recognition) > Algorithms

No matter how clever your algorithms and heuristics, having a ton of data trumps all.

We’ve been mixing ingredients for a while at Repustate, trying to come up with the perfect recipe for our problem. Our goal is to determine what somebody is intending to buy based on what they write on various social media outlets. For example, “I want to buy a bike” is a statement of purchase intent, while “There goes the last bus” is not. We went about trying to solve this problem by taking a small sample of data, manually tagging it as being purchase intent or not, and then hoping to extrapolate. This is generally the approach taken in the world of NLP. We derived all sorts of clever rules-based logic to try to accurately predict this intent. Turns out, it’s next to impossible to do this accurately based solely on “state-of-the-art” algorithms and systems.

You see, it’s not that the algorithms and methodologies that academics in the NLP world publish are useless; it’s just that they’re not useful when applied to more general sets of documents. Let me explain.

Academics generally take a corpus of text from a very specific subset of language (movie reviews are always a popular choice) and train their machine learning systems on that. Their results are often great (some > 90% accuracy) but I view these as contrived. Movie reviews or articles from Reutuers’ archive, another popular source, are well written pieces of English text. The sentences are structured properly, there is proper subject-verb agreement etc. In other words, it’s almost the complete opposite of what you read on Twitter, Facebook or a blog comment.

The reality is that if you want to accurately tag social media content with semantic meaning, you can’t rely on the traditional means of learning. You have to do what Google does and amass a ridiculous amount of data, find patterns, and then try to predict. Google Instant is an example of this. Their predictions of what you’re trying to search for are based on the billions of n-grams they’ve generated over the past decade. When you type “How can I get my girlfriend to”, with a high degree of probability, Google knows what you’re going to type next. I won’t reproduce here the possibilities Google shows as that’s more than slightly NSFW.

And that’s what Repustate is doing now. Grabbing loads of data and categorizing it manually. It’s the only way.

You know what, Twitter can actually be useful

Despite the noise and nonsensical hash tags, Twitter actually does contain some useful information. You just need the tools to find it.

When Repustate first opened its doors, we saw a lot of content coming from established media sources. The overwhelming majority of content being pushed through our API by our developer partners came from sites like the New York Times and TechCrunch. But for the social marketer out there, the true value lies in analyzing the user generated mediums, namely Facebook and Twitter.

Some brands use their Twitter feeds as complements to their RSS feed (wrong!), while some celebrities use it to connect with their fan base and establish a more personal connection (right!). But from my personal experience with Twitter, both as a user and as a lurker, its actual use and its potential utility were two completely different things.

By and large Twitter is a ghetto of content. Annoying chain-letter-like-tweets that play on some hash tag subculture (e.g. #thingsyouhateinthemorning) dominate the public stream. Throw in some slang and poor spelling, and you have a breeding area for incomprehensible dialogue.

But lo and behold, if you’re searching for very specific items, even amongst the sea of crap, you will find some gold. And that’s what we’ve uncovered the past few weeks. Twitter users occasionally produce nuggets of information that a social media marketer would love to know. Complaints about products, feature requests, product comparisons, and opinion mining from their social network are quite commonplace on Twitter. It’s quite stunning and eye opening to see that Twitter is evolving into a focus group that is ripe for dissection.

The ability to create strategic and highly targeted direct marketing campaigns is greatly increased as a result of the information that people volunteer about themselves and the products they buy. The next step for Repustate is to extract the value, organize it, and present it in such a way that the transition from discovery to action is as quick and seamless as possible.

What makes Google Instant so good (and the copy-cats not so good)?

There’s more to Google Instant than ajax. It’s the n-grams, stupid.

With the release of Google Instant, developers everywhere have sprung into action developing their own version of Instant for their favourite web service. This avalanche of development was undoubtedly spurred on by one developer’s job offer from YouTube based on his YouTube Instant work. But what many of the copy-cats fail to recognize is that an asynchronous results page is not what makes Google Instant effective; it’s their ability to predict your search. And how do they do that? N-grams my friend, n-grams.
What’s an n-gram? It’s simply a string of ‘n’ consecutive tokens. For example a tri-gram (or 3-gram) would be “how are you” or “las vegas vacation”. A bi-gram would be “hey there” or “justin bieber”. Google stores every single n-gram it can get its hands on (via Google Search, GMail, Google Voice, Google Voice Search etc.) in order to improve its accuracy in predicting a human’s search intent. In fact, Google published their list of most commonly used n-grams a while back based on their corpus.
So what does this all mean to you, the endeavouring developer? Repustate has added to its API a call to generate n-grams. With this call, you can generate your own n-grams over any data set you like or based on any web page you like and come up with your own prediction algorithms. For a quick demo, head over to our home page and using the demo box on the right, choose “Generate n-grams” from the drop down, enter a URL, and you’ll get a quick sampling of our n-gram generator.

Or better yet, sign up for our API (it’s free and provides unlimited use) and start having at it. If you have any ideas or suggestions, please let us know here.

Two new updates to our API

Repustate’s API welcomes a new member to the family this week.

Our API will see two new updates this week. The first is our ever popular “clean-html” API call. We’ve beefed it up to handle more cases and to be more resilient in handling odd web pages.The next update is something we’re really happy about and that is our newest API call – ngrams. An n-gram is a string of consecutive tokens of ‘n’ length. For example, a bi-gram is two tokens, such as “I like”. A tri-gram would be “I like Repustate”. You can count as high as you like, but in english, rarely do you go above 5-grams.What’s the importance of n-grams? They let us see frequencies and commonalities which occur in written text, which of course is crucial to our cause. You know when you type in a Google search and it just happens to know what you’re looking for? That’s because it’s returning the most common n-grams people have typed. Google has the world’s largest collection of n-grams; Repustate is trying to get there!

As usual, let us know what you want to see or what you don’t like. We listen to everyone.

Social media is failing, now what?

Is social media not the panacea people make it out to be? Or is it a case of “you’re doing it wrong”?

TechCrunch published an article[1] recently describing the results of a survey a German market research firm conducted. The survey found that by and large, social media projects (think Twitter campaigns, Facebook pages, attempts at viral videos on YouTube) fail. Now it’s not clear from the survey what “fail” is meant and which metrics were used to gauge success or failure, but I’m sure those in the industry can judge by their gut that for the most part, enterprise forays into social media have been busts.There are some great comments on the TechCrunch site by some astute readers who in some form or another, hit the nail on the head. Here’s a sampling of my favourites:Martin Edic: “Most of this ‘marketing’ is a knee-jerk reaction to the question ‘why aren’t we in social media?’”

Jenni: “Reputation management is just one small branch of social media.”

Hazel Nieves: “Worst of all is the many ‘professionals’ in the roles of marketing and PR who have no clue on how to create and execute 21st century marketing. They are simply playing the role to keep their jobs.”

Alvin Tan: “Using social media as a broadcaster/megaphone is sub-optimal.”

Matthew: “I have yet to meet a social media whizz who can speak in depth about measurement, nor have I yet to meet a social media expert who has come from an engineering background, or who has been involved in any sort of actual architecting, delivery and running of platforms, or applications.”

Each of these comments sums up why Repustate started and why its doing things differently than others in the field. Here are the biggest problems from our point of view:

1) There is too much douche-baggery in the social media business. Everybody is a guru, yet nobody can produce quantifiable results to justify the promotion they granted upon themselves.

2) The tools needed to measure social media effectiveness don’t exist yet (we’re working on it!). Imagine trying to analyze the effectiveness of your landing pages back in 1997 before Google Analytics. That’s where we are right now with social media measurement. Some companies are changing that [2], but we have a ways to go.

3) The current offerings to solve the above problems are so woeful it’s embarrassing. Basic sentiment analysis is being touted as the be-all-end-all. Being able to tell a brand that person X on Twitter just wrote something negative about them is useless. I’d want to know why it was negative. I’d want to know what I can do about it. In short, I want an actionable strategy. Imagine hiring an SEO consultant who after a couple thousand dollars came back and said, “OK, I’ve done the analysis, and your site is not optimized. Here’s my invoice.” You’d want to know *why* it wasn’t optimal. Are the title tags not relevant? Is the keyword density for your desired search terms too low? Is the markup poor?

It seems just like eCommerce was in the bad old days prior to proper SEO tools, A/B testing and the like, social media is still waiting to grow from its infancy. Repustate is aiming to give it the growth spurt it needs, but in the meantime, we all have to realize that social media is here to stay and an inability to extract value from it is a case of “you’re doing it wrong”.

[1] http://eu.techcrunch.com/2010/08/23/why-social-media-projects-fail-%E2%80%93-a-european-perspective/
[2] http://www.syncapse.com

Surprises in API usage

How a last minute, indifferent decision lead to our most popular API call.

Repustate’s mission statement is to become the world’s largest collection of natural language processing tools. To meet this challenge, we started out with a small set of API calls and are constantly adding and improving with each passing week. Internally, we developed a tool that extracted out the most important text from any web page. If you visit any site today, there’s usually some menu at the top, footer links on the bottom, maybe some ads on one side, perhaps links to other articles on the other side, and the main article down the middle. Often when data mining, you want what’s just right down the middle, the heart of the article.So we wrote a python script to do this. On a whim, we decided to expose this through our API as well. Wouldn’t you know it, clean-html is our most popular API call – by far. In fact, about 60% of all of our API calls are to clean-html, which suits us just fine, but it’s kinda funny. A throwaway decision ended up being our most popular feature.Just goes to show that what one man’s simple, utilitarian API call is another man’s invaluable data processing tool. We’re trademarking that last sentence.

How I almost got to skip my exam and get an A+ (and maybe $1 million)

I coulda been famous, I coulda been a contender.

A PDF came across our Twitter feed today (http://www.scribd.com/doc/35539144/pnp12pt – courtesy of @arnaudsj) and it reminded me of my favourite moment in undergrad. For those who think the PDF is tl;dr – I’ll summarize quickly. The researched who authored the paper believes he has found a proof to a very difficult problem in computer science for which there is a $1 million prize. For 30-40 years, the best minds in computer science and math have tried to solve this problem. Keep that in mind as I tell you my story.In my 3rd year of computer science at York University in Toronto, my algorithms professor, Eric Ruppert, was teaching us about P and NP. He then told us about the $1 million dollar prize for anyone who could either prove or disprove P == NP. He then threw this carrot out there: “And if anyone can solve it, they get an automatic A+ and get to skip the exam.” I was getting about a C+ at the time and did not look forward to the exam (with good reason, only 3 people passed it!)So I went home that night and tried to solve the problem. I don’t remember the details, but I came up with an algorithm that wasn’t polynomial in computation time and could minimize the time needed to traverse a graph (Travelling Salesman’s problem). I was so excited, I emailed my professor telling him I’ll be in his office the next day with good news.

The next day, I strode into his office, like Caesar returning from a victorious battle, expecting to accept my reward and for the crowds of computer science groupies to throw their bras at me. My professor laughed when I told him what I thought I had come up with. So he began to find a hole in my algorithm. A few minutes went by, and he still couldn’t find a hole. I saw that he began to get a little nervous. After what seemed like an hour, but was probably only 5 minutes, he *finally* came up with a scenario where my algorithm failed to find the correct solution and my dreams were dashed. No A+, no automatic deferral of the final exam, and no $1 million prize. But I almost reached the summit of algorithm proficiency. Alas, like George Costanza, I flew too close to the sun on wings of pastrami.

Why can’t marketers be quantified?

Is it too much to ask marketers to prove they know what they’re doing?

Being an engineer in a company full of engineers, I’ve grown accustomed to a particular world view. Specifically, that all things in life can and should be measured. Now of course, I don’t really think *all* things can be measured. I’m not sure if I can measure how much grief I went through when England was knocked out of the World Cup by Germany (days gone without smiling?), but when it comes to selling yourself and your services, if you can’t put up a number to back up your story, I raise an eyebrow.In particular, the marketing industry draws my ire. The inaccuracies and casino-nature of traditional form of marketing and advertising is well documented (“Half of your marketing budget is wasted – you just don’t know which half”), but online marketing campaigns are different. There’s a slew of companies whose job is to just measure your marketing initiatives. Analytics, analytics, analytics. So in the face of our current mindset of measure, measure and then measure some more, why is it that so few marketers display their effectiveness?What I’m angling for is a marketer (or “Guru” as the douch-ier ones like to be called on Twitter) to proudly display on his or her website: “I started a Twitter contest campaign for brand X and increased the number of followers for that brand by 25%, Facebook fans by 10% and regional revenue by $1.5M” Is it really asking too much for people to be able to prove that they know what they’re talking about? I couldn’t imagine building a piece of software for a client and telling them, as I took their money, “You know, I’m not sure if this is even going to compile. Caveat emptor!”

Now I know those in the biz will counter and say “Yeah, but it’s hard to measure. How do we know that the increase in sales were as a result of the Twitter campaign and not the weather or some other externality? How do we accurately recognize and attach revenue to any given marketing activity? Give us marketers a break college boy!”

If I was a “social media marketing agency”, I would be damn sure to measure everything that I do for my clients. Why? Because when the next potential client comes along, I can increase my chances of landing business by proving my case with hard numbers. What currently passes for “proof” of competence is putting up the logo of a brand you have worked with. That’s garbage.

The only reason I can think of why marketers and their agency colleagues don’t show numbers is because there’s nothing to brag about. They know we’re all playing a game called the Emperor’s New Clothes and nobody has stood out yet and yelled, “You’re all naked!”

Well, you’re all naked.