3.5 Years, 500k Lines of Go (Part 1)

Mar 24, 2017

January 31st 2017 was my last day at Canonical, after working for 3.5 years on what is one of the largest open source projects written in Go - Juju.

As of this writing, the main repo for Juju, http://github.com/juju/juju, is 3542 files, with 540,000 lines of Go code (not included in that number is 65,000 lines of comments). Counting all dependencies except the standard library, Juju is 9523 files, holding 1,963,000 lines of Go code (not including comments, which clock in at 331,000 lines).

These are a few of my lessons learned from my roughly 7000 hours working on this project.

Notably, not everyone on the Juju team would agree with all of these, and the codebase was so huge that you could work for a year and not see 2/3rds of the codebase. So take the following with a grain of salt.

About Juju

Juju is service orchestration tool, akin to Nomad or Kubernetes and similar tools. Juju consists (for the most part) of exactly two binaries: a client and a server. The server can run in a few different modes (it used to be multiple binaries, but they were 99% the same code, so it was easier to just make one binary that can be shipped around). The server runs on a machine in the cloud of your choice, and copies of the binary are installed on new machines in the cloud so they can be controlled by the central server. The client and the auxiliary machines talk to the main server via RPC over websockets.

Juju is a monolith. There are no microservices, everything runs in a single binary. This actually works fairly well, since Go is so highly concurrent, there’s no need to worry about any one goroutine blocking anything else. It makes it convenient to have everything in the same process. You avoid serialization and other interprocess communication overhead. It does lend itself to making code more interdependent, and separations of concerns was not always the highest priority. However, in the end, I think it was much easier to develop and test a monolith than it would have been if it were a bunch of smaller services, and proper layering of code and encapsulation can help a lot with spaghetti code.

Package Management

Juju did not use vendoring. I think we should have, but the project was started before any of the major vendoring tools were out there, and switching never felt like it was worth the investment of time. Now, we did use Roger Peppe’s godeps (not the same as godep btw) to pin revisions. The problem is that it messes with other repos in your GOPATH, setting them to a specific commit hash, so if you ever go to build something else that doesn’t use vendoring, you’d be building from a non-master branch. However, the revision pinning gave us repeatable builds (so long as no one did anything truly heinous to their repo), and it was basically a non-issue except that the file that holds the commit hashes was continually a point of merge conflicts. Since it changed so often, by so many developers, it was bound to happen that two people change the same or adjacent lines in the file. It became such a problem I started working on an automatic resolution tool (since godeps holds the commit date of the hash you’re pinning, you could almost always just pick the newer hash). This is still a problem with glide and any similar tool that stores dependency hashes in a single file. I’m not entirely sure how to fix it.

Overall, I never felt that package management was a huge issue. It was a minor thing in our day to day work… which is why I always thought it was weird to read all the stories about people rejecting Go because of lack of package management solutions. Because most third party repos maintained stable APIs for the same repo, and we could pin our code to use a specific commit… it just was not an issue.

Project Organization

Juju is 80% monorepo (at github.com/juju/juju, with about 20% code that exists in separate repos (under github.com/juju). The monorepo section has pros and cons… It is easy to do sweeping changes across the codebase, but it also means that it doesn’t feel like you need to maintain a stable API in foo/bar/baz/bat/alt/special … so we didn’t. And that means that it would be essentially insane for anyone to actually import any package from under the main monorepo and expect it to continue to exist in any meaningful way at any future date. Vendoring would save you, but if you ever needed to update, good luck.

The monorepo also meant that we were less careful about APIs, less careful about separation of concerns, and the code was more interdependent than it possibly could have been. Not to say we were careless, but I feel like things outside the main Juju repo were held to a higher standard as far as separation of concerns and the quality and stability of the APIs. Certainly the documentation for external repos was better, and that might be enough of a determining factor by itself.

The problem with external repos was package management and keeping changes synchronized across repos. If you updated an external repo, you needed to then check in changes to the monorepo to take advantage of that. Of course, there’s no way to make that atomic across two github repos. And sometimes the change to the monorepo would get blocked by code reviews or failing tests or whatever, then you have potentially incompatible changes sitting in an external repo, ready to trip up anyone who might decide to make their own changes to the external repo.

The one thing I will say is that utils repos are nefarious. Many times we’d want to backport a fix in some subpackage of our utils repo to an earlier version of Juju, only to realize that many many other unrelated changes get pulled along with that fix, because we have so much stuff in the same repo. Thus we’d have to do some heinous branching and cherry picking and copypasta, and it’s bad and don’t do it. Just say no to utils packages and repos.

Overall Simplicity

Go’s simplicity was definitely a major factor in the success of the Juju project. Only about one third of the developers we hired had worked with Go before. The rest were brand new. After a week, most were perfectly proficient. The size and complexity of the product were a much bigger problem for developers than the language itself. There were still some times when the more experienced Go developers on the team would get questions about the best way to do X in Go, but it was fairly rare. Contrast this to my job before working on C#, where I was constantly explaining different parts of the language or why something works one way and not another way.

This was a boon to the project in that we could hire good developers in general, not just those who had experience in the language. And it meant that the language was never a barrier to jumping into a new part of the code. Juju was huge enough that no one person could know the fine details of the whole thing. But just about anyone could jump into a part of the code and figure out what 100 or so lines of code surrounding a bug were supposed to do, and how they were doing it (more or less). Most of the problems with learning a new part of the code were the same as it would have been in any language - what is the architecture, how is information passed around, what are the expectations.

Because Go has so little magic, I think this was easier than it would have been in other languages. You don’t have the magic that other languages have that can make seemingly simple lines of code have unexpected functionality. You never have to ask “how does this work?”, because it’s just plain old Go code. Which is not to say that there isn’t still a lot of complex code with a lot of cognitive overhead and hidden expectations and preconditions… but it’s at least not intentionally hidden behind language features that obscure the basic workings of the code.

Testing

Test Suites

In Juju we used Gustavo Nieyemer’s gocheck to run our tests. Gocheck’s test suite style encouraged full stack testing by reducing the developer overhead for spinning up a full Juju server and mongo database before each test. Once that code was written, as huge as it was, you could just embed that “base suite” in your test suite struct, and it would automatically do all the dirty work for you. This meant that our unit tests took almost 20 minutes to run even on a high end laptop, because they were doing so much for each test. It also made them brittle (because they were running so much code) and hard to understand and debug. To understand why a test was passing or failing, you had to understand all the code that ran before the open brace of your test function, and because it was easy to embed a suite within a suite, there was often a LOT that ran before that open brace.

In the future, I would stick with the standard library for testing instead. I like the fact that test with the standard library are written just like normal go code, and I like how explicit the dependencies have to be. If you want to run code at the beginning of your test, you can just put a method there… but you have to put a method there.

`time` in a bottle

The time package is the bane of tests and testable code. If you have code that times out after 30 seconds, how do you test it? Do you make a test that takes 30 seconds to run? Do the rest of the tests take 30 seconds to run if something goes wrong? This isn’t just related to time.Sleep but time.After or time.Ticker…. it’s all a disaster during tests. And not to mention that test code (especially when run under -race) can go a lot slower than your code does in production.

The cure is to mock out time… which of course is non-trivial because the time package is just a bunch of top level functions. So everywhere that was using the time package now needs to take your special clock interface that wraps time and then for tests you pass in a fake time that you can control. This tooks us a long time pull the trigger on and longer still to propagate the changes throughout our code. For a long time it was a constant source of flakey tests. Tests that would pass most of the time, but if the CI machine were slow that day, some random test would fail. And when you have hundreds of thousands of lines of tests, chances are SOMETHING is going to fail, and chances are it’s not the same thing as what failed last time. Fixing flakey tests was a constant game of whack-a-mole.

Cross Compilation Bliss

I don’t have the exact number of combinations, but the Juju server was built to run on Windows and Linux (Centos and Ubuntu), and across many more architectures than just amd64, including some wacky ones like ppc64le, arm64, and s390x.

In the beginning, Juju used gccgo for builds that the gc compiler did not support. This was a source of a few bugs in Juju, where gccgo did something subtly wacky. When gc was updated to support all architectures, we were very happy to leave the extra compiler by the wayside and be able to work with just gc.

Once we switched to gc, there were basically zero architecture-specific bugs. This is pretty awesome, given the breadth of architectures Juju supported, and the fact that usually the people using the wackier ones were big companies that had a lot of leverage with Canonical.

Multi-OS Mistakes

In the beginning when we were ramping up Windows support, there were a few OS specific bugs (we all developed on Ubuntu, and so Windows bugs often didn’t get caught until CI ran). They basically boiled down to two common mistakes related to filesystems.

The first was assuming forward slashes for paths in tests. So, for example, if you know that a config file should be in the “juju” subfolder and called “config.yml”, then your test might check that the file’s path is folder + “/juju/config.yml” - except that on Windows it would be folder + “\juju\config.yml”.

When making a new path, even in tests, use filepath.Join, not path.Join and definitely not by concatenating strings and slashes. filepath.Join will do the right thing with slashes for the OS. For comparing paths, always use path.ToSlash to convert a filepath to a canonical string that you can then compare to.

The other common mistake was for linux developers to assume you can delete/move a file while it’s open. This doesn’t work on Windows, because Windows locks the file when it’s open. This often came in the form of a defer file.Delete() call, which would get FIFO’d before the deferred file.Close() call, and thus would try to delete the file while it was still open. Oops. One fix is to just always call file.Close() before doing a move or delete. Note that you can call Close multiple times on a file, so this is safe to do even if you also have a defer file.Close() that’ll fire at the end of the function.

None of these were difficult bugs, and I credit the strong cross platform support of the stdlib for making it so easy to write cross platform code.

Error Handling

Go’s error handling has definitely been a boon to the stability of Juju. The fact that you can tell where any specific function may fail makes it a lot easier to write code that expects to fail and does so gracefully.

For a long time, Juju just used the standard errors package from the stdlib. However, we felt like we really wanted more context to better trace the path of the code that caused the error, and we thought it would be nice to keep more detail about an error while being able to add context to it (for example, using fmt.Errorf losing the information from the original error, like if it was an os.NotFound error).

A couple years ago we went about designing an errors package to capture more context without losing the original error information. After a lot of bikeshedding and back and forth, we consolidated our ideas in https://github.com/juju/errors. It’s not a perfect library, and it has grown bloated with functions over the years, but it was a good start.

The main problem is that it requires you to always call errors.Trace(err) when returning an error to grab the current file and line number to produce a stack-trace like thing. These days I would choose Dave Cheney’s github.com/pkg/errors, which grabs a stack trace at creation time and avoid all the tracing. To be honest, I haven’t found stack traces in errors to be super useful. In practice, unforeseen errors still have enough context just from fmt.Errorf(“while doing foo: %v”, err) that you don’t really need a stack trace most of the time. Being able to investigate properties of the original error can sometimes come in handy, though probably not as often as you think. If foobar.Init() returns something that’s an os.IsNotFound, is there really anything your code can do about it? Most of the time, no.

Stability

For a huge project, Juju is very stable (which is not to say that it didn’t have plenty of bugs… I just mean it almost never crashed or grossly malfunctioned). I think a lot of that comes from the language. The company where I worked before Canonical had a million line C# codebase, and it would crash with null reference exceptions and unhandled exceptions of various sorts fairly often. I honestly don’t think I ever saw a nil pointer panic from production Juju code, and only occasionally when I was doing something really dumb in brand new code during development.

I credit this to go’s pattern of using multiple returns to indicate errors. The foo, err := pattern and always always checking errors really makes for very few nil pointers being passed around. Checking an error before accessing the other variable(s) returned is a basic tenet of Go, so much so that we document the exceptions to the rule. The extra error return value cannot be ignored or forgotten thanks to unused variable checks at compile time. This makes the problem of nil pointers in Go fairly well mitigated, compared to other similar languages.

Generics

I’m going to make this section short, because, well, you know. Only once or twice did I ever personally feel like I missed having generics while working on Juju. I don’t remember ever doing a code review and wishing for generics for someone else’s code. I was mostly happy not to have to grok the cognitive complexity I’d come to be familiar with in C# with generics. Interfaces are good enough 99% of the time. And I don’t mean interface{}. We used interface{} rarely in Juju, and almost always it was because some sort of serialization was going on.

Next Time

This is already a pretty long post, so I think I’ll cap it here. I have a lot of more specific things that I can talk about… about APIs, versioning, the database, refactoring, logging, idioms, code reviews, etc.

This is a post in the 3.5 Years of Go series.
Other posts in this series:

Mar 24, 2017 - 3.5 Years, 500k Lines of Go (Part 1)

npf.io