The Hudson Build Farm Experience, Volume II

I’ve been working on a Hudson-based build farm for Sonatype and Maven open source builds since sometime in September of 2008. I’ve learned a lot as a result, so I thought I’d share some experiences from the trenches. In this second installment I’ll discuss a few more details related to remote maintenance, along with the hurdles we encountered integrating Windows into our Hudson farm (and the solutions we found).

Eyes and Ears: Getting Access

Having access to the build farm is critical for maintenance, but it can also be very important to developers who are debugging a failing build. In our build farm, we’re using various mechanisms to provide this access, largely based on what is best suited for a particular VM operating system. The basic requirement here is to provide “natural” browsing capabilities for the filesystem on each VM, along with the ability to upload files if necessary (this came in very handy for installing and testing FlashPlayer, for instance).

SSH

For starters, we have SSH access to all machines in the farm. This is partially borne out of convenience, since we use SSH to connect Hudson nodes together, and partially out of practicality, since it’s one of the best connection methods for headless Linux systems (those running without X Windows). Initially, we had an SSH port mapped through the router VM from the public connection into each Hudson VM. However, we soon realized that direct access to Hudson slave VMs was largely unnecessary since it was next-to impossible to remember which non-standard port led to which VM. Now, we’ve simplified SSH access down to two public ports: one for accessing the webserver VM, and one for accessing the Hudson master instance. From the Hudson master instance, it’s a breeze to connect to any of the Hudson slave VMs, using their internal DNS hostnames to avoid confusion. While this may not be quite as efficient, it greatly simplifies the knowledge of our build farm that someone needs in order to make use of it. It’s simply not realistic to force someone to have a wiki page open so they can lookup the port number to use to reach our Windows Hudson slave. It’s not realistic, and it only adds two new layers of maintenance burden (maintaining the wiki page, and maintaining the NAT mappings). Simple trumps efficient.

Terminal Services

While SSH is a good, all-purpose access mechanism, it’s not exactly ideal for navigating the Windows filesystem. And you can forget about managing (read: killing) Windows processes from the command line; it’s definitely not a user-friendly experience. To compensate, we use Windows Terminal Services to get access to the desktop of the VM. There are some pitfalls associated with Terminal Services on Vista, however…but I’ll get into that in a minute. However, once setup correctly Terminal Services (or TSC, or RDC, or rdesktop, or whatever you prefer to call it) provides a nice, snappy way to navigate a remote Windows system from just about any client operating system. The proliferation of names for this protocol provides a clue as to just how many different client applications there are. Beware, though: they work with varying degrees of success. I’ve found Microsoft’s own RDC application to work best for OS X.

VNC

As if two connection protocols wasn’t enough, our addition of the OSX VM to the build farm forced us to use yet another: VNC. OSX is actually pretty well-behaved when it comes to ease of navigation over a SSH connection, and managing processes on the machine via the command line is really no problem at all. However, there is at least one function that all OSX machines perform which absolutely requires access to the desktop: software updates. Without graphical access to the Mac, we can’t install updates to the operating system since Software Update is a graphical application. So, while we don’t use VNC very often at all, we do have to maintain access to the Mac desktop for the relatively rare software update to run. Incidentally, VNC has also proven quite handy for installing new software out on the Mac.

Finally, there is one maintenance task that even the best remote connection has a hard time coping with. As anyone who manages remote machines for a living will tell you, there are times when there simply is no substitute for an on-site thumb to push the power button. In the early days of this build farm, we experienced several instances where our Ubuntu Hudson slave simply maxed out its available RAM. I’m not sure whether dedicated hardware would react in precisely the same way, but when we exceeded the allocated RAM for that virtual machine, it simply froze. Solid. The only way to bring it back was to power cycle the virtual machine (even this didn’t always work…I’ve completely rebuilt the Ubuntu VM a couple of times now). This is where our last line of defense comes in: VMWare Infrastructure Client. Infrastructure Client gives you a bird’s eye view into the running ESXi machine, where you can provision, decommission, modify, and manipulate all of the running VMs. You can even grab a monitor-ish view on an individual VM in order to execute commands on the running OS. When nothing else works, Infrastructure Client still does. This handy application has saved my bacon on multiple occasions, from misconfigured network interfaces to the aforementioned RAM-locked coma. But beware: this level of access comes at a price. VMWare Infrastructure Client only runs on Windows, a fact that for me required installation of a Windows XP virtual machine via Parallels on my Mac. Fortunately, this solution works well, and I don’t have to keep a dedicated Windows machine in the closet.

The Square Peg: Dealing with Windows

On the whole, Windows has not only been the hardest to integrate with our Linux-based Hudson master instance, but also by far the hardest to connect to in a consistent, reliable way. I suppose some of this is to be expected, given the difference in filesystem structure between Windows and the rest of the world. But what may be a little less expected - it was for me, at least - is just how hard it is to integrate Windows into an overall plan for remote access. I’m going to address the remote access question first, since it at least has a reasonable solution; but rest assured, I’ll get back to the filesystem challenges soon.

Windows Connectivity

As I mentioned above, SSH has become our lowest-common-denominator connection for remote access. SSH works well (in almost all cases), is easy to configure, and performs a double duty by allowing both remote shell access to a machine and the ability to remotely execute a program, as with the following example:

<code>ssh jeos1 bash -l -c echo $PATH
</code>

SSH also has the advantage of ubiquity; you can install it everywhere, on basically any operating system. Or so I thought. It turns out that SSH for Windows comes in basically two flavors: a Windows-native “port” that uses Windows’ own authentication methods and contains some pretty interesting divergences from *nix or BSD brands of SSH, and SSH over Cygwin. In a previous life, I had already tried using the native SSH daemon, with little or no success. Since many of us at Sonatype have experience working with Cygwin, we opted for that solution. Once setup, this option works fairly well…but the setup is not for the faint of heart. After much Googling, I came across this explanation for installing SSHd on Cygwin+Vista. While it doesn’t provide much explanation along the way, the steps do work for gaining basic SSH access to a Vista machine. Later, I found out the hard way that providing desktop access to applications executed through that SSH session was another matter entirely, one that required quite a bit of trial and error to solve. In the end, we configured the SSHd to run via the same user (‘hudson’) as the Hudson slave.jar, to ensure Hudson builds that use FlashPlayer to run tests had access to the desktop. In point of fact, I’m still not sure we have that one completely figured out…

Once we had a reliable SSH connection to our Vista VM, we learned that failed Hudson jobs could pollute the running system with orphaned processes, Java or FlashPlayer instances that would never complete for whatever reason, but instead would squat on their reserved sockets and file locks until forcibly removed. In the end, we simply could not avoid enabling desktop access via Terminal Services. However, we found out that merely turning on this service is not enough; Vista uses a newer protocol version than just about any client out there (except, I imagine, the Vista terminal services client). Got XP? Maybe OS X? Tough luck.

Once again, after much Googling, we learned that Terminal Services on Vista uses a new protocol feature called Network Authentication by default. This feature excludes pretty much the rest of the free world from connecting to your Windows machine, even if you ask nicely. Luckily, I dug up this page that gave some tips for working around Network Authentication. For specifics, see the section entitled “Enable Vista Remote Desktop host computer use of Network Level Authentication”, about 3/4 of the way down that page. With Network Authentication disabled, XP and OS X clients were free to connect to our Vista VM, allowing us access to install and test applications such as FlashPlayer, and to manage(KILL!) orphaned processes.

Integrating Hudson: Linux Master, Windows Slave

As many developers know, writing software that’s meant to run on both Linux and Windows can be particularly difficult. It’s not simply that Linux uses ‘/’ for file- and directory-separation, while Windows uses ‘\’. It’s not just that Linux uses ‘/’ to denote the root of the filesystem for the whole computer so that URLs often look like: file:///home/hudson/.m2/settings.xml, while Windows uses drive letters and has multiple filesystem roots resulting in URLs that look like: file:/C:/Users/hudson/.m2/settings.xml. No, the pain of programming for both Windows and non-Windows target environments is all of this, and much more. From the aforementioned path-formatting differences, to different line endings, to differing approaches to child processes and file locking, working in a mixed environment like this can often feel like death by a thousand cuts. It’s not that any one of these inconsistencies is insurmountable; it’s that, taken together, coping with all the differences can lead to very, very complex configurations.

As a case in point, adding Windows to our predominantly Linux-compatible Hudson build farm involved:

installing Cygwin to run SSHd
writing a separate slave-launching script (start.bat, mentioned in the last post)
liberal use of the Windows symlink approximation (abomination?) called mklink, to approximate a filesystem layout normalized with that of our other Hudson VMs
a lot of hand-holding to clear orphaned processes resulting from failed builds

In spite of the fact that we brought Windows into the build farm mix early on - we’ve been running builds on Vista since probably around the beginning of October at least - the Windows slave VM still contains more chewing gum, duct tape, and shade-tree engineering than the rest of the build farm put together. The fact that it works for most of our builds is something I attribute more to luck than skill, and it’s still impossible to run the Maven bootstrap on anything but the default JDK version setup in the start.bat file. Forget specifying the Java version in that job definition. The Maven bootstrap uses Ant to orchestrate a rough first pass, then spawns a Maven process toward the end of the build to run unit tests and other verifications. Any explicit declaration of a Java version in this job results in the Windows slave using an incompatible JAVA_HOME path, which causes the build to fail abysmally.

I’ve got a lead on fixing this problem in the form of a patch to Hudson, but for now this issue is filed solidly in the ‘In-Progress’ category.

Path problems aside, we ran into some interesting issues with the Maven versions we installed on Windows. I can’t say I’m entirely sure how this happened; it seems like anything installed should have the equivalent of 755 and 644 directory and file permissions, respectively. In any case, when I installed Maven on the Vista VM, the hudson user didn’t have permissions to actually execute it. As a result, I modified the permissions of the entire /opt/maven equivalent path - the parent directory for all Maven installations - to allow Everyone access to All Permissions. Sure, that’s probably too lax; but this is a slave VM that’s basically cut off from the world, and in the end, it’s pretty much disposable. If it gets compromised, the perpetrator will only have access to OSS source code, and we can decommission/re-provision a new Windows VM quickly thanks to our SVN repository.

Summary

Now that we’ve covered the ins and outs of connecting to the build farm and dealing with the “special” nature of Windows, I think this is a good place to end.

In the last post, I covered the basic topology and system configuration for our build farm. In this post, we’ve talked at length about remote connectivity and Windows-related issues. Please keep an eye out for my next post, when I’ll wrap up this series by discussing some special considerations dealing with VMWare’s ESXi environment, and looking forward to some of the as-yet unresolved issues we’re facing today.