What is an Artifact Repository?


April 3, 2009 By Tim O'Brien

Download “Introduction to Repository Management” as a PDF

Introduction

While many developers have adopted Maven as a build tool, most have yet to understand the importance of maintaining a repository manager both to proxy remote repositories and to manage and distribute software artifacts. This document defines repository and repository management, providing context for developers interested in learning how to use Sonatype’s Nexus to achieve a more efficient development cycle.

What is a Repository?

Maven developers are familiar with the concept of a repository: a collection of binary software artifacts and metadata stored in a defined directory structure which is used by clients such Maven, Mercury, or Ivy to retrieve binaries during a build process. In the case of the Maven repository, the primary type of binary artifact is a JAR file containing Java bytecode, but there is no limit to what type of artifact can be stored in a Maven repository. For example, one could just as easily deploy documentation archives, source archives, Flash libraries and applications, or Ruby libraries to a Maven repository. A Maven repository provides a platform for the storage, retrieval, and management of binary software artifacts and metadata.

In Maven, every software artifact is described by an XML document called a Project Object Model (POM). This POM contains information that describes a project and lists a project’s dependencies – the binary software artifacts which a given component depends upon for successful compilation or execution. When Maven downloads a dependency from a repository, it also downloads that dependency’s POM. Given a dependency’s POM, Maven can then download any other libraries which are required by that dependency. The ability to automatically calculate a project’s dependencies and transitive dependencies is made possible by the standard and structure set by the Maven repository.

Maven and other tools such as Ivy which interact with a repository to search for binary software artifacts, model the projects they manage, and retrieve software artifacts on-demand from a repository. When you download and install Maven without any customization, Maven will retrieve artifacts from the Central Maven repository which serves millions of Maven users every single day. While you can configure Maven to retrieve binary software artifacts from a collection of mirrors, the best-practice is to install a repository manager such as Nexus which can proxy Central repository and cache artifacts retrieved from a remote repository on a server in your own network. In addition to Central, there are a number of major organizations such as Redhat, Sun Microsystems, and Codehaus which maintain separate repositories.

While this might seem like a simple, obvious mechanism for distributing artifacts, the Java platform existed for several years before the Maven project created a formal attempt at the first repository for Java artifacts. Until the advent of the Maven repository in 2002, a project’s dependencies were gathered in a manual, ad-hoc process and were often distributed with the source code for an open source project. As applications grew more and more complex, and as software teams developed a need for more complex dependency management capabilities for larger enterprise applications, Maven’s ability to automatically retrieve dependencies and model dependencies between components became an essential part of software development.

Release and Snapshot Repositories

A repository stores two types of artifacts: releases and snapshots. Release repositories are for stable, static release artifacts and snapshot repositories are frequently updated repositories that store binary software artifacts from projects under constant development. While it is possible to create a repository which serves both release and snapshot artifacts, repositories are usually segmented into release or snapshot repositories serving different consumers and maintaining different standards and procedures for deploying artifacts. Much like the difference between a production network and a staging network, a release repository is considered a production network and a snapshot repository is more like a development or a testing network. While there is a higher level of procedure and ceremony associated with deploying to a release repository, snapshot artifacts can be deployed and changed frequently without regard for stability and repeatability concerns.

Release Artifacts

A release artifact is an artifact which was created by a specific, versioned release. For example, consider the 1.2.0 release of the commons-lang library stored in the Central Maven repository. This release artifact, commons-lang-1.2.0.jar, and the associated POM, commons-lang-1.2.0.pom, are static objects which will never change in the Central Maven repository. Released artifacts are consider to be solid, stable, and perpetual in order to guarantee that builds which depend upon them are solid and repeatable over time. The released JAR artifact is associated with a PGP signature, an MD5 and SHA checksum which can be used to verify both the authenticity and integrity of the binary software artifact.

Snapshot Artifacts

Snapshot artifacts are artifacts generated during the development of a software project. A Snapshot artifact has both a version number such as “1.3.0” or “1.3” and a timestamp in its name. For example, a snapshot artifact for commons-lang 1.3.0 might have the name commons-lang-1.3.0-20090314.182342-1.jar the associated POM, MD5 and SHA hashes would also have a similar name. To facilitate collaboration during the development of software components, Maven and other clients which know how to consume snapshot artifacts from a repository know how to interrogate the metadata associated with a Snapshot artifact and always retrieve the latest version of a Snapshot dependency from a repository.

Repository Coordinates

Repositories and tools like Maven know about a set of coordinates including the following components: groupId, artifactId, version, and packaging. This set of coordinates is often referred to as a GAV coordinate which is short for “Group, Artifact, Version coordinate”. The GAV coordinate standard is the founation for Maven’s ability to manage dependencies. Four elements of this coordinate are described below:

Group Identifier (groupId)

A group identifier groups a set of artifacts into a logical group. Groups are often designed to reflect the organization under which a particular software component is being produced. For example, software components being produced by the Maven project at the Apache Software Foundation are available under the groupId org.apache.maven.

Artifact Identifier (artifactId)

An artifact is an identifier for a software component. An artifact can represent an application or a library; for example, if you were creating a simple web application your project might have the artifactId “simple-webapp”, and if you were creating a simple library, your artifact might be “simple-library”. The combination of groupId and artifactId must be unique for a project.

Version (version)

The version of a project follows the established convention of Major, Minor, and Point release versions. For example, if your simple-library artifact has a Major release version of 1, a minor release version of 2, and point release version of 3, your version would be 1.2.3. Versions can also have alphanumeric qualifiers which are often used to denote release status. An example of such a qualifier would be a version like “1.2.3-BETA” where BETA signals a stage of testing meaningful to consumers of a software component.

Packaging (packaging)

Maven was initially created to handle JAR file, but a Maven repository is completely agnostic where it comes to the type of artifact it is managing. Packaging can be anything that describes any binary software format including ZIP, SWC, SWF, NAR, WAR, EAR, SAR.

Addressing Resources in a Repository

Tools which are designed to interact with the Maven repository, translate these coordinates in a URL which corresponds to a location in a Maven repository. If a tool such as Maven is looking for version 1.2.0 of the commons-lang JAR in the group org.apache.commons, this request is translated into:

.../org/apache/commons/commons-lang/1.2.0/commons-lang-1.2.0.jar

Properties of the Central Maven Repository

The Central Maven repository contains almost 90,000 software artifacts occupying around 70 GB of disk space. You can look at Central as an example for how Maven repositories operate. Here are some the properties of release repositories such as the Central Maven repository:

Artifact Metadata

All software artifacts added to Central require proper metadata including a Project Object Model (POM) for each artifact which describe the artifact itself, and any dependencies that software artifact might have.

Release Stability

Once published to the Central Maven repository, an artifact and the metadata describing that artifact never change. This property of release repositories guarantees that projects which depend on releases will be repeatable and stable over time. While new software artifacts are being published to central every day, once an artifact is assigned a release number on Central, there is a strict policy against modifying the contents of a software artifact after a release.

Repository Mirrors

Central is a public resource, and it is currently used by the millions of developers who have adopted Maven and the tools that understand how to interact with the Maven repository structure. There are a series of mirrors for the Central repository which are constantly synchronized with Central. Users are encouraged to query central for project metadata and cryptographic hashes and they are encouraged to retrieve the actual software artifacts from one of Central many mirrors. Tools like Nexus are designed to retrieve metadata from Central and artifact binaries form mirrors.

Artifact Security

The Central Maven repository contains cryptographic hashes and PGP signatures which can be used to verify the authenticity and integrity of software artifacts served from Central or one of the many mirrors of Central.