The Hudson Build Farm Experience, Volume I

I’ve been working on a Hudson-based build farm for Sonatype and Maven open source builds since sometime in September of 2008. I’ve learned a lot as a result, so I thought I’d share some experiences from the trenches. In this first installment, I’m only going to cover our goals and outline the basic setup of our farm; I’ll save discussion of specific hurdles and advantages offered by our environment for the next post.

The Challenge

Java software must function in a nearly endless variety of runtime environments. While the bytecode itself is basically portable from one operating system to another, I’m sure everyone knows this doesn’t mean software written in Java is automatically portable. The Write-One-Run-Anywhere (WORA) ideal of Java is an ideal; in real life software must be tested on all platforms. In the past, Maven releases have relied on a best-effort approach, where the continuous integration builds and integration tests were run on one operating system, and other operating systems were periodically "spot checked" just a release. We were using JIRA and our development community to compensate for the lack of a real build farm which would have allowed us to continually check for problems on a variety of platforms. Since we were running our CI operations on a Linux, BSD, or Solaris machine (it varied), we relied on developers to file JIRAs for anything that turned up broken on Windows or OS X. Since most of us work on one of these two platforms, the most critical issues were normally caught and fixed. If an issue cropped up on an operating system that wasn’t exposed on the CI system or the developers’ own workstations, it typically survived until the next release cycle, after a user reported it and worked with the development team to test and get it resolved.

When we started releasing open source here at Sonatype, we decided to take a much more proactive role in verifying our software. Our approach has been fairly straightforward: make sure we encounter and fix as many of the issues in our software as possible, before they have a chance to trip up our users. Like any other aspect of the software engineering world, our ideal has been tempered by a dose of practical reality…but I’ll get to that later. For now, suffice it to say that we wanted the ability to test software on as many operating systems as could run Java, and as many Java implementations as we are willing to say we support commercially. Additionally, the results of all these myriad builds should be collated and easy to understand.

Since our business is very much dependent on the health of Maven, we decided this new build farm should be provided as a resource for the Maven community in addition to our own open-source offerings.

Enter Hudson

We settled on Hudson as our continuous integration system for a few reasons. First, it’s dead simple to install and use in the non-distributed sense, and many of us had glowing opinions of this little application. Even now, my sentiments toward Hudson are similar to that of a long-time friend and colleague: I’m still impressed despite the fact that I now have enough experience to see its flaws.

The second and third reasons for choosing Hudson were even more practical. It was the only system with a history of supporting multiple versions of Maven and multiple JDKs. Also, at the time, it was the only system that could collate distributed builds from multiple slave nodes running different operating systems. While this latter feature was new - and we really didn’t appreciate just how new at the time - it was working and documented.

Finally, Hudson offers a plugin API and a large number of plugins to help cope with extra requirements like IRC notification or Git support. These plugins were a big attraction, but the fact that Hudson’s developers were thoughtfully exposing a plugin API meant that we could probably provide any extra bells and whistles we might require.

In fact, at the point where we decided to implement a build farm, we already had a one-dimensional, non-distributed Hudson deployment. So I guess you could even add that to the list of advantages: we had a certain amount of experience maintaining a Hudson instance. What remained was learning how to setup and maintain the underlying array of operating systems on which the build farm would rely, then learning how to run Hudson on this array.

Nuts and Bolts: Our Farm Environment

Since we really weren’t sure what operating-environment details we might require for adequate testing in our build farm, we opted to run the whole thing - or, as much of it as possible - on a large VMWare ESXi machine. This gives us the ability to provision operating systems as needed, or decommission old VMs when they outlive their usefulness. It also gives us a certain degree of scalability, since (theoretically) we can deploy copies of a given operating system to adjust to demand. In practice, this scalability is limited by the resources available on the machine as a whole, but more on that later! In any case, alternatives like Xen would have limited the range of operating systems we could have deployed. Alternatives like separate hardware per node would leave us guessing up front what our real operating system needs would be, and for which types of hardware to support those needs. VMWare ESXi seems uniquely suited for this sort of system; its flexibility has proven to be a great asset as we planned and then updated our build farm.

Hardware

Our hardware consists of two quad-core 3.16 GHz Xeon CPUs, with 32GB of RAM, and a 1.3TB disk array. On this machine, we run a router VM, a bare-bones httpd VM that uses mod_proxy to connect to our Hudson master VM, which is an Ubuntu JeOS instance. In addition, we have Hudson slave VMs for Windows Vista 64-bit, Ubuntu JeOS (to prevent overloading of the master instance), FreeBSD, Solaris, and CentOS. Finally, we have a Mac OS X machine colocated with the ESXi machine and connected up as if it were just another virtual machine.

VMs and Configuration

The router VM provides NAT/firewall capabilities for the entire farm, as well as DHCP for new VM setups, and internal DNS. The httpd VM literally runs nothing above the operating system level except for SSHd, Apache, and logrotate (to manage the disk-clogging tendencies of webserver logging). For obvious reasons, the Hudson master and slave VMs are far more involved, mostly due to Hudson, SSH, Subversion, and Maven configurations, plus the Ant, Maven, Java, and Hudson files themselves. All of these Hudson VMs require basically the same software, and actually share all of the same files that can be used across platforms. To facilitate keeping all these configurations and software installations up to date across an array of systems, we check them into a dedicated Subversion repository, then simply check them out on each machine as a series of working directories. Got an update for the Maven settings you need to use for builds on the farm? No problem. Just update the VM working directories. To make this even easier, we’ve created a couple of Hudson jobs that will actually call svn update for each working directory on each Hudson VM. In cases where a piece of software is operating system-dependent (hint, the JDK) we simply create a directory structure in Subversion for each operating system class, with mirrored directory structures within. Want to add a new Linux slave? Check out the JDKs from /grid/linux/opt/java. New Solaris slave? You’ll want /grid/solaris/opt/java. Everything we need to provision a new Hudson VM is contained within Subversion.

The normalization of our VM directory structures is a huge advantage when you’re supporting a distributed Hudson environment; so much so that it’s highly recommended in Hudson’s own documentation for setting up distributed builds. Since all path information is managed by the master Hudson instance and passed on to the slave instances, it’s absolutely critical that the directory structures and installed software look as uniform as possible. This has obvious consequences for adding Windows to the mix, which is actually still one of our biggest pain points…but I’ll discuss this at length in the next installment.

Running Hudson

As for how to actually run Hudson, we’re using JavaServiceWrapper for the master instance. It gives us a nice, familiar way to configure and control the system, all packaged in a script that’s compatible with System V initialization. The slaves are actually controlled through the master instance as well, using SSH public-key authentication and a convoluted launch command. Each new slave VM gets a standardized $HOME/.ssh directory that allows the Hudson master to use its SSH key to login to the slave machines without a password. The only thing that remains is to add the new slave’s DNS hostname to the $HOME/.ssh/known_hosts file on the Hudson master, to keep the SSH client from prompting when Hudson starts the slave connection. Once this is done, we simply configure the Hudson master to connect to the new slave using a command line like the following:

<code>ssh jeos1.grid.sonatype.com bash -l -c /opt/hudson/slave/start.sh
</code>

The start.sh script itself just does some basic environmental setup to account for differences in the SSH server behavior on different operating systems, then launches the Hudson slave.jar. Our particular script looks a little crufty, scarred from our path up the learning curve:

<code>#!/bin/bash

export JAVA_BASE=/opt/java/sdk
export JAVA14=$JAVA_BASE/1.4
export HOME=/home/hudson
export MAVEN_OPTS="-Xmx512M -Duser.home=${HOME}"
export M2_HOME=/opt/maven/apache-maven-2.1.0-M1
export ANT_HOME=/opt/ant/apache-ant-1.7.1
export JAVA_HOME=/opt/java/sdk/current
export PATH=$M2_HOME/bin:$ANT_HOME/bin:$JAVA_HOME/bin:$PATH

if [ -d /opt/local/bin ]; then
  export PATH=/opt/local/bin:$PATH
fi

svn up /opt/hudson/slave

if [ -f $HOME/.hudson-config ]; then
  svn up $HOME/.hudson-config
  source $HOME/.hudson-config
fi

cd $HOME
nice -n 19 java \
    -Djava.util.logging.config.file=/opt/hudson/slave/logging.properties \
    -Duser.home=${HOME} \
    -jar /opt/hudson/slave/slave.jar
</code>

On our Windows VM, we wound up using a DOS batch file that is much the same as the above: start.bat. We found it much simpler to spawn a batch file in Windows, despite the fact that we’re actually connecting through an SSH daemon that’s running on Cygwin. Similarly, on Solaris we have to tweak our HOME envar to use /export/home/hudson, which warrants yet another start.sh variant: start-solaris.sh.

Summary

Now that I’ve gone through the basic topology and configuration for our Hudson environment, I think I’ll end the post here. In the next post, I’ll discuss some of the harder lessons we’ve learned from actually running Hudson in this configuration, particularly when it comes to including operating systems like Vista. Be sure to tune in!