Does Your Tool Depend on the Central Maven Repository?


December 15, 2008 By Tim O'Brien

Here are three quick notes for people who write tools which depend on the Central Maven Repository.   Adhering to these standards will help to preserve the free, public resource that millions of Maven users depend on.   We think it is great that people are interacting with the repository, but we want to make sure that we’re all doing so in a responsible manner that conserves bandwidth.

#1: Populate User-Agent Headers

If your build tool or repository manager interacts with the Central Maven Repository, you need to start setting a reasonable User Agent.  Nexus, Artifactory, Archiva, Maven, and Ivy all identify themselves in the User-Agent header.  This is incredibly important for the health of the Central repository, if there is a bug in a release of one of these tools which pegs the bandwidth max of Central, this can jeopardize the availability of this central resource for others.   Appropriate User-Agent headers help the repository maintainers quickly identify problems so that we can make sure that Central remains available for the majority of users.

In general, if your tool ever interacts with the Central Maven repository it is a good idea to maintain contact with at least one member of the Apache Maven PMC, and there are a few working for Sonatype including Brian, Jason, and John.

If you decide to write a new build tool, that’s great.  Before you distribute it to tens of thousands of developers, make sure your client sets the User-Agent.   We’ve had cases recently that involved misconfigured tools that lacked an identifying User-Agent.  If the tool in question had a meaningful User-Agent header it would have taken all of five minutes to find the problem and identify the project in question.   Instead it took members of the team multiple weeks of effort.

#2: Don’t Scrape Central, Don’t Walk the Repo

There are a few services which have decided to scrape the entire contents of the central repository into another copy and then to operate on this copy. While there are different ways to do this (rsync, getting it from a mirror), we often see people using a tool like wget with a modified User-agent header field to constantly scan the entire repository. This creates a storm of requests against Central, and wastes bandwidth.  It is also another way to crowd out other people trying to use the Central Repository.

Again, this is a case of communicating with the Maven PMC. If you are building a service that is going to be consuming Gigabytes of bandwidth on Central, you need to get in touch with the PMC as they are the body that oversees and supports the Central Maven repository. If you want to do this, you can, but people are going to likely point you at a mirror.  Even with a mirror, you need to make sure that the operators of that particular mirror don’t mind you siphoning off a couple of Gigs every month.  Bandwidth isn’t free, and the first priority is always availability.

#3: Don’t use a 404 as a Search Tool

There are a few tools out there that haven’t figured out how to use the Nexus index from the Maven repository  (Maven’s one of them). Going forward, most tools should start consulting the Nexus index to test for the presence or absence of an artifact. 404 requests are not a problem in and of themselves, but we’re trying to encourage people to minimize the number of remote interactions with Central so we can maximize availability.  We could like serve tens of thousands of 404 requests a second, but we’d like to think that people want to minimize remote interaction when possible.

If you are writing a tool and you want to know how to interact with the Nexus Index.  The code to do so is licensed under an Eclipse Public License and is available from http://nexus.sonatype.com.