Maven Indexer: Sonatype's Donation to Repository Search

We create a search index for the Maven repository so that you don't have to. What does this mean for you? It means that you don't have to run a "little Google" in your datacenter just to search for the latest log4j library, and you also don't have to sacrifice Terabytes of bandwidth to download thousands of artifacts you'll never use to just to find the handful you need for your project. This is all done for you on Central, and the tools you use to search Central, Nexus and m2eclipse all benefit from this pre-made index file.

While this seems like such a simple idea, the Maven ecosystem hasn't had a standard way to search the repository for the majority of its history. For much of the last decade there was no reliable way to search for an artifact. In this post, I'm going to review this history and talk about Maven repository search and where we think search is headed. With the release of Nexus OSS 1.9 it is now a good time to summarize the results of Sonatype donation of the Nexus Indexer to the Apache Software Foundation.

In the beginning…

In the beginning there was ibiblio. That's not entirely true. When Jason van Zyl created the first Maven repository in 2002, there were a number of different servers and mirrors involved, but after a few weeks of migrating between Apache servers, the first incarnation of Central ended up at ibiblio.

If you were depending on the Maven repository back then, you have a good appreciation for how far we've come in just a few years. Back then, the bandwidth was terrible and the connections were iffy. It was touch-and-go both in terms of stability but also in terms of process. You can still see some of the echos of the initial effort today in the form of old group IDs. If you are wondering why projects like log4j and commons-lang exist as top-level group IDs it is because the initial artifacts didn't follow the same strict standards for naming. In 2002 the repository was emerging, and it would take a few years for people to agree on standards and formats.

Maven's Structure Becomes a Standard

Within a year, people using the Maven repository had come to rely on it as an essential piece of development infrastructure. While Java had been around for almost a decade, no one had created a viable repository structure. There was no way to distribute artifacts between projects, and the Maven repository was created at a critical time in the development of open source Java. As Java really start to take off in the enterprise, as systems grew more complex, and as an explosion of open source Java libraries hit the scene from places like Apache, the Maven repository quickly became the established way to distribute artifacts. There was no going back.

Maven Search: The Dark Ages

Years passed, the format of the repository changed between Maven 1 and Maven 2. As more and more projects started to use Maven or publish to the Maven repository, there was a need to create some sort of search interface to help developers locate artifacts in the repository.

Initial efforts for repository search were disconnected and relied on multiple, independent systems running proprietary analysis on the entire repository. (I know this myself because I took at stab at writing one in 2006.) To search the repository, you had to download the entire repository and run a series of regular jobs to grab changes and update an index. There was no standard search mechanism that would facilitate tool integration, and most people discovered Maven artifacts either by clicking around in a browser or by word-of-mouth.

Earlier Repository Managers and Search

Early repository managers contained independent implementations for search indexing. Early versions of Archiva had an independent index library based on Lucene, and Artifactory relies on a JCR store. While it was clear that repositories were going to play a key role in providing an easy way to search for artifacts by metadata and class name, these earlier approaches still required people maintaining repository managers to download the entire repository and run a time consuming, CPU-intensive process just to create searchable index.

This is where the Nexus Indexer started to come into play.

The Nexus Indexer

When Sonatype created a repository manager, we also aimed to create a standard and sustainable way to search for artifacts. The Nexus Indexer defines a standard format for repositories, but, most important, it defines a portable format that captures information about a repository. This format is what your repository manager downloads from a remote repository, and it is the reason why repository managers like Nexus and Archiva can allow you to browse the contents of a remote repository without downloading "the entire internet".

Sonatype's open source repository manager, Nexus, was the first repository manager to define a standard format for a repository index. We didn't just define a product with a search feature, we set out to define the standard. Using this model, servers like Maven Central and other popular open source forge repositories would periodically index repository contents and present clients with an index. Instead of downloading Terabytes of data from the internet, your Maven client, your IDE, your repository manager could download an optimized index containing all of the metadata about artifacts in the repository.

This innovation also had the effect of offloading the responsibility from your repository manager. If you want to search the repository, you don't have to set aside a few days for your own instance to crawl Terabytes of data. Maven Central is indexed once, in a central location, and the world saves zillions of CPU cycles because of that fact.

The Nexus Indexer was an immediate success, the index format was created on a weekly schedule on Maven Central. Sonatype carved out the Nexus Indexer as a separate component, released it under a very conservative, BSD-style license, doing all we could to make sure that everyone interacting with the repository could read this format. Very quickly all of the repository managers on the market could both read from and write to a Nexus Index. For example, Archiva now relies on the NexusIndexer, and will likely move to the Maven Indexer.

Donating Nexus Indexer to the Maven Project

As this index format became more widespread through the Maven ecosystem and more generalized for other languages and systems. Sonatype thought it made perfect sense to remove any direct association with Nexus. It was becoming a part of the foundational infrastructure for the community, and, in a decision that might make some executive's heads spin with disbelief, we donated it to the Maven community. We gave it away.

For something this low-level, this important to the community, it didn't make any sense for Sonatype to hold on to this resource. The Nexus Indexer is now called the Maven Indexer and the code behind this index is now a part of the Apache Maven project.

At Sonatype, we understand the role we play in supporting the universal infrastructure that enables the world to develop and collaborate. We also understand how important it is to ensure that something as important as the index format be truly open: managed by an active and healthy community, free of difficult licensing questions, and unencumbered by patent concerns.

Next Up: Nexus OSS 1.9, Now with the Maven Index

To complete the circle of our successful transition of the Nexus Indexer to the Apache Maven project, we're announcing that the next version of Nexus is one of the first projects to incorporate the newly minted Maven Indexer. Stay tuned.