Python requests encoding – using the Python requests module might give you surprising results
But there’s a subtle issue with regards to encodings that tripped us up. A customer told us that some Chinese web pages were coming back garbled when using the clean-html API call we provide. Here’s the URL:
In the HTML on these pages, the charset is gb2312 which is an encoding that came out of China used for the Simplified Chinese set of characters. However, many web servers do not send this as the charset in the response headers (due to the programmers, not the web server itself). As a result, requests defaults to ISO 8851-9 as the encoding when the response doesn’t contain a charset. This is done in accordance with RFC 2616. The upshot is that the Chinese text in the web page doesn’t get encoded properly when you access the encoded content of the response and so what you see is garbled characters.
Here’s the response headers for the above URL:
curl -I http://finance.sina.com.cn/china/20140208/111618150293.shtml HTTP/1.1 200 OK Content-Type: text/html Vary: Accept-Encoding X-Powered-By: schi_v1.02 Server: nginx Date: Mon, 17 Feb 2014 15:54:28 GMT Last-Modified: Sat, 08 Feb 2014 03:56:49 GMT Expires: Mon, 17 Feb 2014 15:56:28 GMT Cache-Control: max-age=120 Content-Length: 133944 X-Cache: HIT from 236-41.D07071951.sina.com.cn
There is a thread on the Github repository for requests that explains why they do this – requests shouldn’t be about HTML, the argument goes, it’s about HTTP so if a server doesn’t respond with the proper charset declaration, it’s up to the client (or the developer) to figure out what to do. That’s a reasonable position to take, but it poses an interesting question: When “common” use or expectations, go against official spec, whose side does one take? Do you tell developers to put on their big boy and girl pants and deal with it or do you acquiesce and just do what most people expect/want?
Specs be damned, make it easy for people
I believe it was former Twitter API lead at the time, Alex Payne, who was asked why does Twitter include the version of the API in the URL rather than in the request header, as is more RESTful. His paraphrased response (because I can’t find the quote) is that Twitter’s goal was to get as many people using the API as possible and settings headers was beyond the skill level of many developers, whereas including it in the URL is dead simple. (We at Repustate do the same thing; our APIs are versioned via the URL. It’s simpler and more transparent.)
Now the odd thing about requests is that the package has an attribute called apparent_encoding which does correctly guess the charset based on the content of the response. It’s just not automatically applied because the response header takes precedence.
We ended up patching requests so that the apparent_encoding attribute is what gets used in the case no headers are set by default, but this is not the default behaviour of the package.
I can’t say I necessarily disagree with the choices the maintainers of requests have made. I’m not sure if there is a right answer because if you write your code to be user friendly in direct opposition to a published spec, you will almost certainly raise the ire of someone who *does* expect things to work to spec. Damned if you do, damned if you don’t.