Tag Archives: Failover

Maven Central Failover Mechanism Improves: Temporary IP change on Monday


May 9, 2011 By Brian Fox

Spoiler Alert! This post contains information about a change to Maven Central’s IP addresses. If your network has firewall rules in place that need specific IPs, be sure to read this post.

We’re working hard and investing continued effort into making sure that Central is as available as possible. As Maven Central supports a world of developers, even a few minutes of downtime is completely unacceptable to us. In line with our previous efforts to make Maven Central as bullet-proof and available as possible, we are planning to make the US repository even more fault tolerant using a tool called Pacemaker. Once we’ve had time to evaluate the impact of the changes described in this post we will deploy similar measure for our European Union and Asia/Pacific mirrors.

As a follow up to our previous enhancements to central, we are planning to make the Maven Central US repository even more fault tolerant.

The US repository runs on two virtual machines (VMs) in a VMWare cluster with 4 physical nodes configured to use the High Availability support in VMWare. Despite having multiple levels of fault-tolerance, recovery from a misconfiguration or other catastrophic failure still requires a DNS update to a standby IP to restore Maven Central. This DNS change requires time: time to make the change and then an often unpredictable time for DNS changes to propagate over the entire Internet.

This is unacceptable to us. Millions of developers depend on Maven Central, we’ve invested in redundant virtual machines running on redundant physical hardware. If there is an unforeseen event, the problem should be addressed in a few seconds.

To achieve immediate failover in the event of failure we will be using a tool called Pacemaker to manage Maven Central’s floating IP cluster. Pacemaker monitors the repository IP address, Nginx process status, and sample content from Maven Central. If Pacemaker identifies a failure in any one of these components it will immediately failover to the backup machine. In my testing, this takes about 3-5 seconds to occur.

In a previous post I discussed the systems we have in place and how the IPs are configured:

We are aware that some users have firewall rules that are locked to the external service IP. Because of this, we strive to maintain a consistent IP for each system, however the primary mechanism for accessing the repository is by DNS for most users. At times, our failover escalation or maintenance procedures may require us to redirect the DNS for one system to another. For this reason, if you have firewall rules in place that need specific IPs, please allow this list so that you won’t be affected by any temporary transitions:

  • 207.223.240.88 : US primary
  • 207.223.240.92 : US staging / standby
  • 89.167.251.252: UK Primary
  • 89.167.251.253: UK standby

Continue reading