This is a (long) blog post about our experience at Repustate in migrating a big chunk of code from Python/Cython to Go. If you want to read the whole story, background and all, read on. If you’re interested in just what Python developers need to know before taking the plunge, click the link below.
Tips & tricks in migrating from Python to Go.
One of the best technological feats that we’ve done here at Repustate was implementing Arabic sentiment analysis. Arabic is one tough nut to crack because of the complex morphological forms Arabic words can take. Tokenization (splitting a sentence up into individual words) is also tougher in Arabic than in say, English, because Arabic words can contain whitespace within the word itself (e.g. the position of ‘aleph’ within a word). Without giving away our secret recipe, Repustate uses support vector machines (SVM) to come up with the most likely meaning behind a sentence and then apply sentiment to that. In total, we use 22 models (i.e 22 SVMs) and each word in a document is analyzed. So if you have 500 words in a document, that’s more than 10,000 comparisons against the SVMs.
Repustate is almost entirely a Python shop; we use Django for the API and website. So it only made sense (at the time) to keep the code base homogenous and implement all of the Arabic sentiment engine in Python as well. As far as prototyping and implementing goes, Python is hard to beat. Very expressive, awesome 3rd party libraries etc. If you’re serving up web pages, it’s perfect. But when you’re doing low level computations, doing lots of comparisons against hash tables (dictionaries in Python), things get slow. We were able to process about 2-3 Arabic documents per second, which is too slow. By comparison, our English language sentiment engine can do about 500 per second.
So we fired up the Python profiler and began investigating what was taking so long. Remember above how I said we have 22 SVMs and each word passes through it? Well that was all done in serial, not in parallel. OK, our first thought was to change to this to a map/reduce like operator. TL;DR: The map/reduce idiom stinks in Python. When you need concurrency, Python is just not your friend. At PyCon 2013, Guido spoke about Tulip, his new project that was hoping to remedy this, but that’s not due out for a while, and why wait there’s already something better out there.
Golang or go home
My friend at Mozilla told me that Mozilla Services was switching over to Go for much of their logging infrastructure, in part because of the awesomeness of goroutines. Go was designed by the folks at Google and it was designed with concurrency as a first-class notion, not an afterthought, as Python’s various solutions are. So we went about making the change from Python to Go.
While the Go code is not yet in production, the results are ridiculously encouraging. We’re doing 1000 documents/s now, using WAY less memory, and not having to debug ugly multiprocess/gevent/”why won’t Control-C kill my process” code that you get in Python.
Why we love Go
Anyone who has a bit of an understanding of how programming languages work (interpreted vs. compiled, dynamic vs. static) will say, “Well duh, obviously Go is faster”. Yeah, we could have re-written the whole thing in Java and seen similar improvements, but that’s not why Go is such a winner. The code you write with Go just seems to be correct. I can’t really put my finger on it, but somehow once the code compiled (and it compiles QUICKLY), you just get the feeling that it’ll work (not just run without error, but even logically be correct). I know, that sounds very wishy-washy, but it’s true. It’s very similar to Python in terms of verbosity (or lack thereof) and it treats functions as first-class objects, so functional programming is easy to reason about. And of course, goroutines and channels make your life so much easier. So you get the performance boost of static typing and having finer control over memory allocation but you don’t forfeit too much in expressiveness.
Things we wish we knew
With all the compliments out of the way, you really do need a different mindset at times when dealing with Go compared to Python. So here’s a list of notes I kept as the migration took place – just random things that popped into my head when converting Python code to Go:
- No built-in type for sets (have to use maps and test for existence)
- In absence of sets, have to write your own intersection, union etc. methods
- No tuples, have to write your own structs or use slices (arrays)
- No __getattr__() like method, so you have to always check for existence rather than setting defaults e.g. in Python you can do value = dict.get(“a_key”, “default_value”)
- Having to always check errors (or at least explicitly ignore them)
- Can’t have variables/packages that aren’t used so to test simple things requires sometimes commenting out lines
- Going between byte and string. regexp uses byte (they’re mutable). It makes sense, but it’s annoying all the same having to cast & re-cast some variables.
- Python is more forgiving. You can take slices of strings using indexes that are out of range and it won’t complain. You can take negative slices – not Go.
- You can’t have mixed type data structures. Maybe it’s not kosher, but sometimes in Python I’ll have a dictionary where the values are a mix of strings and lists. Not in Go, you have to either clean up your data structures or define custom structs
- No unpacking of a tuple or list into separate variables (e.g. x,y,x = [1,2,3])
- UpperCamelCase is the convention (if you don’t have a title case on the function/struct in a package it won’t be exposed to other packages). I like Python’s lower_case_with_underscores more.
- Have to explicitly check if errors are != nil, unlike in Python where many types can be used for bool-like checks (0, “”, None can all be interpreted as being “not” set)
- Documentation on some modules (e.g. crypto/md5) is sparse BUT go-nuts on IRC is awesome, really great support available
- Type casting from number to string (int64 -> string) is different than going from byte -> string (just use string(byte)). Need to use strconv.
- Reading Go code is definitely more like a programming language whereas Python can be written as almost pseudocode. Go has more non-alphanumeric characters and uses || and && instead of “or” and “and”.
- Writing to a file, there’s File.Write(byte) and File.WriteString(string) – a bit of a departure for Python developers who are used to the Python zen of having one way to do something
- String interpolation is awkward, have to resort to fmt.Sprintf a lot
- No constructors, so common idiom is to create NewType() functions that return the struct you want
- Else (or else if) has to be formatted properly, where the else is on the same line as the curly bracket from the if clause. Weird.
- Different assignment operator is used depending on whether you are inside & outside of function ie. = vs :=
- If I want a list of just the keys or just the value, as in dict.keys() or dict.values(), or a list of tuples like in dict.items(), there is no equivalent in Go, you have to iterate over maps yourself and build up your list
- I use an idiom at times of having a dictionary where the values are functions that I want to invoke given a key. You can do this in Go, but all functions have to accept & return the same thing i.e. have the same method signature
- If you’re using JSON and your JSON is a mix of types, goooooood luck. You’ll have to create a custom struct that matches the format of your JSON blob, and then Unmarshall the raw json into an instance of your custom struct. Much more work than just obj = json.loads(json_blob) like we’re used to in Python land.
Was it worth it?
Yes, a million times, yes. The speed boost is just too good to pass up. Also, and this counts for something I think, Go is a trendy language right now, so when it comes to recruiting, I think having Go as a critical part of Repustate’s tech stack will help.