Maven Central Failover Mechanism Improves: Temporary IP change on Monday


May 9, 2011 By Brian Fox

Spoiler Alert! This post contains information about a change to Maven Central’s IP addresses. If your network has firewall rules in place that need specific IPs, be sure to read this post.

We’re working hard and investing continued effort into making sure that Central is as available as possible. As Maven Central supports a world of developers, even a few minutes of downtime is completely unacceptable to us. In line with our previous efforts to make Maven Central as bullet-proof and available as possible, we are planning to make the US repository even more fault tolerant using a tool called Pacemaker. Once we’ve had time to evaluate the impact of the changes described in this post we will deploy similar measure for our European Union and Asia/Pacific mirrors.

As a follow up to our previous enhancements to central, we are planning to make the Maven Central US repository even more fault tolerant.

The US repository runs on two virtual machines (VMs) in a VMWare cluster with 4 physical nodes configured to use the High Availability support in VMWare. Despite having multiple levels of fault-tolerance, recovery from a misconfiguration or other catastrophic failure still requires a DNS update to a standby IP to restore Maven Central. This DNS change requires time: time to make the change and then an often unpredictable time for DNS changes to propagate over the entire Internet.

This is unacceptable to us. Millions of developers depend on Maven Central, we’ve invested in redundant virtual machines running on redundant physical hardware. If there is an unforeseen event, the problem should be addressed in a few seconds.

To achieve immediate failover in the event of failure we will be using a tool called Pacemaker to manage Maven Central’s floating IP cluster. Pacemaker monitors the repository IP address, Nginx process status, and sample content from Maven Central. If Pacemaker identifies a failure in any one of these components it will immediately failover to the backup machine. In my testing, this takes about 3-5 seconds to occur.

In a previous post I discussed the systems we have in place and how the IPs are configured:

We are aware that some users have firewall rules that are locked to the external service IP. Because of this, we strive to maintain a consistent IP for each system, however the primary mechanism for accessing the repository is by DNS for most users. At times, our failover escalation or maintenance procedures may require us to redirect the DNS for one system to another. For this reason, if you have firewall rules in place that need specific IPs, please allow this list so that you won’t be affected by any temporary transitions:

  • 207.223.240.88 : US primary
  • 207.223.240.92 : US staging / standby
  • 89.167.251.252: UK Primary
  • 89.167.251.253: UK standby

Since we declared .88 to be the primary IP in the US, I intend to use that as the new clustered ip. The migration plan is as follows:

  1. At approximately 9PM CST Monday, May 9th, we will update the DNS entries for repo1/repo2.maven.org to point at 207.223.240.92 (the standby server).
  2. Tuesday morning, the 207.223.240.88 ip will be added to the cluster. We’ll test the ip and failover using this address.
  3. Once the testing is successful, we will revert the DNS entry back to 207.223.240.88.

Both DNS shifts should be completely transparent since both servers will be actively capable of serving artifacts, so regardless of which IP resolves for you during that transition time, the repo will respond. The Expiration TTL on the DNS data is currently 5 minutes, so normally everyone would follow over during that time, but we are going to allow 12 hours (just in case).

We will be rolling out this same setup to our UK servers in a few weeks.

I can’t thank Contegix enough for their ongoing support. They conceived of and implemented this virtual IP solution. They are playing a big role in making Maven Central more reliable.