Using python requests – whose encoding is it anyway?

Python requests encoding – using the Python requests module might give you surprising results

If you do anything remotely related to HTTP and use Python, then you’re probably using requests, the amazing library by Kenneth Reitz. It’s to Python HTTP programming what jQuery is to Javascript DOM manipulation – once you use it, you wonder how you ever did without it.

But there’s a subtle issue with regards to encodings that tripped us up. A customer told us that some Chinese web pages were coming back garbled when using the clean-html API call we provide. Here’s the URL:

http://finance.sina.com.cn/china/20140208/111618150293.shtml

In the HTML on these pages, the charset is gb2312 which is an encoding that came out of China used for the Simplified Chinese set of characters. However, many web servers do not send this as the charset in the response headers (due to the programmers, not the web server itself). As a result, requests defaults to ISO 8851-9 as the encoding when the response doesn’t contain a charset. This is done in accordance with RFC 2616. The upshot is that the Chinese text in the web page doesn’t get encoded properly when you access the encoded content of the response and so what you see is garbled characters.

Here’s the response headers for the above URL:

curl -I http://finance.sina.com.cn/china/20140208/111618150293.shtml
HTTP/1.1 200 OK

Content-Type: text/html
Vary: Accept-Encoding
X-Powered-By: schi_v1.02
Server: nginx
Date: Mon, 17 Feb 2014 15:54:28 GMT
Last-Modified: Sat, 08 Feb 2014 03:56:49 GMT
Expires: Mon, 17 Feb 2014 15:56:28 GMT
Cache-Control: max-age=120
Content-Length: 133944
X-Cache: HIT from 236-41.D07071951.sina.com.cn

There is a thread on the Github repository for requests that explains why they do this – requests shouldn’t be about HTML, the argument goes, it’s about HTTP so if a server doesn’t respond with the proper charset declaration, it’s up to the client (or the developer) to figure out what to do. That’s a reasonable position to take, but it poses an interesting question: When “common” use or expectations, go against official spec, whose side does one take? Do you tell developers to put on their big boy and girl pants and deal with it or do you acquiesce and just do what most people expect/want?

Specs be damned, make it easy for people

I believe it was former Twitter API lead at the time, Alex Payne, who was asked why does Twitter include the version of the API in the URL rather than in the request header, as is more RESTful. His paraphrased response (because I can’t find the quote) is that Twitter’s goal was to get as many people using the API as possible and settings headers was beyond the skill level of many developers, whereas including it in the URL is dead simple. (We at Repustate do the same thing; our APIs are versioned via the URL. It’s simpler and more transparent.)

Now the odd thing about requests is that the package has an attribute called apparent_encoding which does correctly guess the charset based on the content of the response. It’s just not automatically applied because the response header takes precedence.

We ended up patching requests so that the apparent_encoding attribute is what gets used in the case no headers are set by default, but this is not the default behaviour of the package.

I can’t say I necessarily disagree with the choices the maintainers of requests have made. I’m not sure if there is a right answer because if you write your code to be user friendly in direct opposition to a published spec, you will almost certainly raise the ire of someone who *does* expect things to work to spec. Damned if you do, damned if you don’t.

Social sentiment analysis is missing something

Sentiment analysis by itself doesn’t suffice

An interesting article by Seth Grimes caught our eye this week. Seth is one of the few voices of reason in the world of text analytics that I feel “gets it”. His views on sentiment’s strengths and weaknesses, advantages and shortcomings align quite perfectly with Repustate’s general philosophy.

In the article, Seth states that simply getting relying on a number denoting sentiment or a label like “positive” or “negative” is too coarse a measurement and doesn’t carry any meaning with it. By doing so, you risk overlooking deeper insights that are hidden beneath the high level sentiment score. Couldn’t agree more with this and that’s why Repustate supports categorizations.

Sentiment by itself is meaningless; sentiment analysis scoped to a particular business need or product feature etc. is where true value lies. Categorizing your social data by features of your service (e.g. price, selection, quality) first and THEN applying sentiment analysis is the way to go. In the article, Seth proceeds to list a few “emotional” ones (promoter/detractor, angry, happy etc). that quite frankly I would ignore. These categories are too touchy-feely, hard to really disambiguate at a machine learning level and don’t tie closely to actual business processes/features. For instance, if someone is a detractor, what is that is causing them to be a detractor? Was it the service they received? If so, then customer service is the category you want and negative polarity of the text in question gives you invaluable insights. The fact that someone is being negative about your business means almost by definition they are detractors.

Repustate provides our customers with the ability to create their own categories according to the definitions that they create. Each customer is different, each business is different, hence the need for customized categories. Once you have your categories, sentiment analysis becomes much more insightful and valuable to your business.