This is a post in response to [Thomas Hallgren’s post] about Maven and P2 repositories. Thomas doesn’t seem to know about the indices that are being produced from Maven repositories today and how they are currently used. I thought I would share some information about the Maven ecosystem for those who aren’t familiar.
Maven Central has had an index for over two years. The index is created by Nexus-based technology, either the stand-alone tool or the server side application, and has been integrated in tools like Netbeans, IntelliJ, M2Eclipse, Archiva, Artifactory and obviously Nexus. It does get a little hard to track all those artifacts and driven by use cases over time we’ve adjusted how the index is produced. Here are a couple use cases that we’ve run across.
1) When Maven users are trying to create new projects with Maven we use a rapid prototyping system in called Archetype. In a nutshell it’s a Maven project in the form of a set of Velocity templates stored in a JAR. The JAR is deployed into a Maven repository made available to users for quickly starting new projects. How do users find the Archetypes? Inside M2Eclipse we make a query against the Nexus index to find all artifacts in the repository which have a Maven packaging of “maven-archetype”. Then what we have returned is a list of Archetypes that the user can select from to create a new project.
2) When users type in class name or import statement in M2Eclipse, the JAR that contains that class may not be present on the local system. We try and make it handy to find the required JAR by providing an easy way to search through the connected Maven repositories for the JARs that contain the given class.
For the indices to be truly useful you also have to allow for arbitrary information to be submitted to the index and then queried against. I want all artifacts produced by Jason on Monday’s for the last year which are API compatible with GAV, have 80% or greater code coverage and submitted after 1pm. Brian knows that I wake up from my nap at 1pm and after that is when I’m at my best. Ad hoc querying is very powerful and that power has not been lost on us.
Federated searching? Has also been done in Nexus for a long time. We have two implementations of the index which support federation: one using Lucene and one using RDF. Neither are perfect, but not terminal and they have existing querying languages which means we don’t have to invent another one. We currently use this to provide a unified searching mechanism across all the Maven repositories in the world that wish to participate via [http://repository.sonatype.org]. We not only group the searching across repositories but we can also order the search.
We also take into account other useful pieces of information in our indices like the presence of Javadoc JARs, source JARs, and PGP signatures. Makes it very nice in tooling to be able to dynamically search and pull these down for users. We also have class file information which allows us to do some pretty cool things as well like search for method signatures or other class information. We’ve been doing this for years as well.
What happens when these indices get really big? Well, we’ve run across that as well and you have to deal with the incremental publication and consumption of indices because you can’t have a million folks downloading whole index every time you update it. We learned this one the hard way. The index of Maven Central is actually the most requested artifact from Maven Central there is. We serve it out of S3 now.
P2 is not fast enough when dealing with large repositories because we needed to make modifications for it to work efficiently in Tycho which, as far as I know, is the first build system to publish and consume bundles from remote P2 repositories. Pulling in all the metadata into memory doesn’t work so well when you dynamically aggregate the metadata of many P2 repositories.
So there is certainly a lot of work to do but we’ve got the interoperability between P2 and Maven repositories covered. We also have the interoperability of P2, Maven, and OBR repositories covered because, of course, no system can live in a vacuum even if some of their developers might like to.