The Hudson Build Farm Experience, Volume IV


February 5, 2009 By John Casey

In Progress: The Learning Curve We’re Still Climbing

Now that we’ve covered the high points of our Hudson build farm setup here at Sonatype, I want to discuss some of the current issues we’re facing at the moment. It’s important to realize that providing high-quality continuous integration is a long, involved process…not a quick, one-off event. Sure, you can get Hudson up and running fairly rapidly in a non-distributed environment. However, the path to distributed, multi-OS builds that capture a full range of testing can be very, very complex. In the end, if you can get by simply compensating for the problems I talked about in this series of posts, then you’re probably pretty lucky. Here at Sonatype, we’re certainly very conscious of the fact that our continuous integration setup could run more perfectly, and we continue to chip away at the list of things we’d like Hudson to verify automatically on our behalf. So, in the interests of full disclosure, I’m including a short wish list of items we’re currently working on.

1. Cross-OS Path Translation

I mentioned this previously, in my second build-farm post. Currently, there is very little ability to manage the Java versions used in a build orchestrated from a Linux Hudson master instance and executed on a Windows Hudson slave instance. This is a particular problem in cases such as the Maven bootstrapping process, where one Java process spawns another. In the case of the Maven bootstrap, an Ant build produces a crude version of the Maven binaries that is then used to run the real Maven build, which in turn is used to run the Maven core integration tests. Yes, this is a very complex, involved way of doing things…but it has the advantage of being pretty self-contained, and it checks a whole lot of assumptions about Maven and your build environment in the process. Since this is such a demanding build, it’s perhaps inevitable that it will act like a canary in the mineshaft for things like invalid JAVA_HOME environment variables like those passed from a Linux-based orchestration system to a Windows-based execution system. As you might expect, the all of the paths passed from one to the other are completely wrong. In fact, the only way to run the Maven bootstrap on this sort of build farm is to use the default JAVA_HOME envar setup in the slave startup process. This startup process is necessarily different for *nix and Windows Hudson slaves, so you can take advantage of this and cheat the build in a way.

However, when the Hudson slave instance uses Java 1.5 and you want to verify that the Maven codebase doesn’t accidentally include Java5-specific APIs (yes, this is a common problem), you need the ability to force the Maven bootstrap to use JDK 1.4 instead. Currently, this is impossible. One reasonable solution might be to allow each environment to define a custom envar – say, $JAVA14_HOME – then use that in the definition for a Java version in Hudson. This way, each Hudson slave could interpret this envar locally and have access to the correct path for that JDK. Fortunately, we may have some help on the way in the form of a patch posted against Hudson bug #2918 that implements exactly this sort of solution. Tom, please rest assured that I’m very excited about your patch, and I’m going to resume efforts to test it and get it deployed in our farm ASAP once I’m done with this post!

2. Management of Related Builds

Currently, we have multiple different flavors of build happening out on the build farm for a single project – Nexus, for example. We have builds for the 1.2.x branch of Nexus, and others for the trunk (currently progressing towards 1.3). In addition, we’re interested in smoke tests for these builds, since integration testing can take quite awhile…so, we have one build run for each branch that simply builds the project including unit tests, and another that builds the branch with integration tests enabled, to satisfy our longer-term concern that all use cases remain satisfied. The result is a proliferation of jobs out on the build farm, each of which executes on all available operating systems. This means that each build fills one executor on each slave, and that changes in both branches (such as a common configuration that changes and gets merged across from one branch to the other) can easily fill up all available executor slots, clogging the entire build farm for a significant period of time.

Clogging is bad enough. However, to strike a balance between the proliferation of different build types in Hudson and the disk space consumed by separate local repositories used to provide isolation between jobs, we’re actually using the same local repository location for all Nexus builds. This means we run the risk of having two (or more) Nexus builds run over one another in terms of artifact downloads, which produces phantom build failures (it’s effectively a race condition that has nothing to do with the code and everything to do with the build environment). In order to avoid this situation, we’re using the Locks and Latches Hudson plugin to ensure that only one of these related builds runs at a time. This seems to be an effective solution to the problems associated with concurrent builds using the same local repository.

Except now we’re left wondering, “at what cost?” since Locks and Latches seems not to be designed for a distributed environment. Or, if it is, it’s meant to control access to some external resource, and we’re misusing it…but relying on external resources for testing is generally a bad idea, so…well, anyway, I digress. In any event, Locks and Latches giveth, and Locks and Latches taketh away. What we gain in terms of build stability from this plugin we lose in terms of execution time for the job as a whole. Locks and Latches constrains all jobs using a particular lock, so that only one can execute at a time. The problem is, the lock is shared for all relevant jobs on all slaves which means that each Nexus build will only build on one slave at a time, and won’t run as long as another Nexus build is running. The net effect is to take a fairly long build process and make it unbelievably longer, to the point where it’s unclear that we actually gain any benefit by defining separate integration-test and non-integration-test builds…they all wait on the same lock, so the extra job definitions may actually have the opposite effect in some cases.

A better solution would allow a job’s myriad slave executions to run in parallel – why not, they’re on separate machines after all – but restricts jobs sharing a particular lock to run serially with respect to one another. I short, exactly what the Locks and Latches plugin does, except instead of applying the locking mechanism after Hudson distributes the job to slave instances and clogs their executors, it would apply to the master build queue ahead of distributed queueing. This is on my TODO list as a bug to file in Hudson, but if anyone has a better solution I’m all ears…please let me know!

3. Builds Requiring Graphical Environments

We have some builds in our portfolio – notably, m2eclipse and some other Flex-based projects – that require a graphical environment in order to run their tests. It’s simply the nature of these projects that they can only execute in a graphical environment. For example, m2eclipse is a plugin for Eclipse, and it doesn’t really make much sense to run Eclipse in a headless environment. For builds, sure; for tests (and the use-case execution that implies), no.

In our legacy environment, we have Xvfb setup for m2eclipse and other projects, but at some point we’re going to have to learn how to migrate these builds over to our farm. After all, these projects have the same needs, to be exercised and tested on any operating system on which we expect them to work. For the most part, this should be a relatively straightforward – if not easy – task. Setting up Xvfb on nix machines including Solaris, OS X, and Linux, *shouldn’t be too difficult in theory. There’s even a Hudson plugin that can manage a Xvnc process for you, starting it before the build and shutting it down afterwards. So, can you guess why we haven’t migrated our graphical builds out to the farm yet?

Yup. Windows. The Xvnc plugin, for instance, requires a minimal configuration: the path to Xvnc. Great, what do I tell Hudson for Xvnc, that will also work on Windows? Remember, we’re launching the Hudson slave as a Windows process, only using Cygwin to make the SSH connection. It didn’t work very well (read: not at all) to try launching the slave instance from within Cygwin, and it’s unclear that we’d actually be testing our builds in a Windows environment if we could do that. And this problem is for a plugin that Hudson provides; what if Xvnc doesn’t cut it, and we really need something more like Xvfb or even X itself? We may travel a ways down the road to fixing this if we can provide slave-specific environment variables, as I described above; but we can’t simply turn off a command selectively in one slave environment and not in others.

Hudson enforces a degree of uniformity among it slaves that is really at the heart of the problems we have including Windows in the mix. I don’t think Hudson is wrong to do this, at least to some extent…but it does inflict a lot of pain at times. At the moment, we’re still coming to terms with the depth of the problem when it comes to graphical builds. Unfortunately, that means all I have for the moment is a description of the problem as I understand it. I don’t know whether this is the full extent of the problem, and as a result, can’t really believe that I can see a workable solution yet. If anyone has experiences with this, again I’m very interested to hear from you!

Essential Equipment and Resources

Okay, now that I’m done whining (for now), I want to end this post with some software and information I’ve found indispensable during this process. Hopefully, these can help make your experience a little less rocky.

  1. OS X Equipment
    1. Chicken of the VNC
    2. Parallels (VMWare Infra Client GUI only runs on Windows)
    3. iTerm
    4. Microsoft RDC (Remote Desktop Client for OS X)
  2. Manuals, How-Tos, References
    1. OpenSolaris SysAdmin Guide (This isn’t the one I used, but it looks very similar…maybe just a different format from the PDF I found.)
    2. Windows W32Time How-To (NTP on Windows Vista)
    3. Cygwin+SSHd on Vista (bare-bones instructions, but they do work)
    4. Symlinking on Windows Vista – Mklink
    5. VMWare and Timekeeping Reference (PDF)
    6. FreeBSD on VMWare and other tips (Good information about running virtualized systems)