Why namespacing matters in public open source repositories

February 10, 2021 By Brian Fox

9 minute read time

Yesterday we saw the disclosure of a report showing how a security researcher was able to successfully infiltrate 35+ name brand companies, primarily via npm. Ironically, the mechanism used to perpetrate the attack, what's being called namespace confusion or dependency confusion, is one that I'm quite familiar with and has been at the heart of the contention of how we've managed the Maven Central repository for 16+ years vs the users who push back on the standards and just want it to be "easy like npm."

Since I've written extensively about this problem both recently and in the past, I'm going to try something different and assemble a blog post like we build our software today. I'm going to build the case by stitching together parts of the narrative from the past so we can see how it continues to hold up.

Providing namespaces is really important

During JavaOne 2017, I had a hallway chat that informed me about some of the plans around the new module names that immediately set off alarm bells for me. I recognized that it had the potential to either fork the community ala Python 2 vs 3 or to toss away the well understood Java classpath naming conventions from which Maven derived its own coordinate convention. I ended up getting involved in the JSR 376 Java Modules spec process to help avoid the mistakes I was observing in those other ecosystems. At that time, I wrote both to the mailing lists, and on a Dzone post detailing some of the history and concerns:

Traditionally, the Java ecosystem has been very mature in terms of naming and namespacing. The reverse fqdn (fully qualified domain name) introduced into the Java package was a great choice to ensure classes don't conflict. Popular build tools such as Maven and nearly all those that followed built upon this key concept, with the introduction of "GroupId" also using the fqdn as part of the name to ensure the coordinates were properly namespaced.

We've seen some ecosystems diverge from this, leading to new challenges that ultimately had to be reversed. A great example can be seen in the "tragic mistake from npm creators," which was to launch without a namespace concept. Eventually, NPM started running out of useful names and had to backtrack to introduce "scopes," which is really just a namespace. The real problem here is that the major change in namespace was backed in after several years of momentum without it. It's taken a long time for tooling and best practices to catch up to scopes and, in the interim, people have been left with a dual mode — a "some namespaced, some not namespaced" situation — that has created chaos.

Note: The fact that so much of the npm ecosystem is effectively not namespaced has actually created potential build time malware injection possibilities. If I know of a package in use by a company through log analysis, bug report analysis, etc., I could potentially go register the same name in the default repo with a very high semver and know that it’s very likely that this would be picked up over the intended, internally developed module because there’s no namespace.

Sadly, that last paragraph describes precisely the attack disclosed today. If only I had the foresight to go after bug bounties with it. By the way, it seems like I wasn't the only person concerned about this a long time ago. Here are a few other fortune tellers:

Dependency Confusion from 2016

And from Stack Overflow:

Python Namespace Confusion

Fortunately, the JSR spec lead took our concerns to heart and ultimately made changes that prevented the fork in backwards compatibility and recommended the reverse dns style of naming as the convention for the new Java Modules for Java 9+.

Enforcing namespaces in public repositories is even more important

Just last month, we had the first known insertion of intentionally malicious components into the Central Repository. However, because of the namespace inherent in Maven and the long standing validation of those before people are allowed to publish, the impact was minimized. I described that aspect of the history as follows:

Sonatype's Maven Central Repository is home to over 6,000,000 open-source Java components commonly used by the developer community.  Each month, about 200,000 new component releases are added to the repository. In 2019, Maven Central served 226 billion download requests.

Unlike most other open source software component ecosystems, Maven is built upon a strong namespacing concept that requires that every artifact be addressed using (minimally) a three part coordinate: Group ID : Artifact ID : Version. Group IDs follow the Java Package convention which is the reverse of a development team’s DNS. For example, all Apache Software Foundation artifacts have org.apache as the start of their Group ID. Org.apache.maven is Maven, org.apache.struts is Struts etc.

When a new publisher comes along requesting access to publish to Central, the requirements enforce that you can verify control of either the DNS for the Group ID, or for control over the account/repo for coordinates derived from platforms such as GitHub. As a part of this screening, users are asked to verify their GitHub account before they are assigned a Group ID, such as 'com.github.codingandcoding,' as is the ID in this case.

When we see brandjacking occur in repositories without a namespace, you can see that it can be easy to trick users into using foo-bar when the legit project is actually fooBar or foo_bar. In the Maven case however, it becomes a bit harder given the GroupId. As seen in this example, the publisher created something called:

com.github.codingandcoding:maven-compiler-plugin 

which is clearly different from

org.apache.maven.plugins:maven-compiler-plugin.

Note: The gist here is that Maven has created a proper element, "groupId" and Maven and the Central Repository, first encouraged, then enforced that anything published here can be tied back to a dns entry that you control. Without the proper namespace elements, this would be nearly impossible to enforce. 

Unfortunately, this is the exact situation we seem to find currently still in Ruby Gems as well as PyPI. Additionally, while npm retroactively introduced "scopes" as a form of namespace, there is no enforcement of who can claim and publish to those coordinates. In fact, many of the components in today's disclosure were in fact using clear namespaces, but nothing stopped him from publishing them to the public repository anyway.

This conversation about validation is timely as well given the impending shutdown of Bintray and JCenter. This repo was long marketed as making it easy to publish, answering the rally cry of "just let me publish like npm." Many projects are now finding themselves having to grapple with the fact that they have used coordinates that they don't control. Worse, some have found themselves domain squatted:Domain Squatting Bintray/ Jcenter

At a glance, it seems like a reasonable question for someone to ask: Why do I need to buy a domain simply to publish my components to Central? Well, the first answer is you don't. Subprojects of common shared campfires like GitHub are valid if you can prove commit access to the project.

However if you truly want to use your own custom coordinates, then yes, you need to buy that domain. Fortunately in 2021, most domains can be had for dollars a year. 

As with many things, using DNS as our naming authority is imperfect. However, it is very likely still the best choice. Why?

  • It's decentralized, you can buy a domain in countless places and importantly, it is verifiable by any observer. While we have no visibility into what user accounts on Jcenter "own" given coordinates in their walled garden, anyone can look into dns. You might even find they left behind the traces of their validation for others to still observe.
    • This provides the ability to move from one repository to another without the risk of losing your coordinates.
    • It also provides an element that safeguards against competing repos having clashing coordinates.
  • It is well understood and universally supported. It is the phone book of the internet after all.
  • There are proven mechanisms to prevent hijacks and orderly transfers from one entity to another.
  • It's consistent - as long as you control the domain and continue to publish the package there is a ongoing proof of ownership.
  • Owning the domain may allow you to defend even more vigorously against brandjacking and cybersquatting via the ACPA.
  • Since reverse fqdns is the standard for Java Classpaths and now Java Modules, and Central is primarily about Java components, it’s a nice symmetry that you can expect the GroupId com.myproject to include Java Modules and Classes of the same com.myproject name.

Bringing this all together, hopefully you now understand why namespaces are so critical to sharing components, regardless of the language and ecosystem. The lack of enforced namespacing, though good from a barrier of entry perspective, has the downside of shifting risk and responsibility entirely downstream to an uninformed consumer. This is an acceptable tradeoff for some, but as software supply chain attacks continue to increase in severity and prevalence, we as an industry need to decide what we value more in our de-facto critical infrastructure.

All of these experiences added up is why I'm calling on all public registries to follow a similar process of validation, regardless of its reliance on dns as the master record. Whatever the mechanism can be — something needs be done.

In the words of a favorite twitter meme, "This didn't age well:"

"npm is a mostly anarchic system. There is not sufficient need to impose namespace rules on everyone."

If you're wondering if you've been affected by a namespace attack, Sonatype has released a script on GitHub that users of Sonatype Nexus Repository can use to check if any of their private dependencies have the same names as existing squatted packages on the public npm, RubyGems, and PyPI repos.

If you are a Sonatype Nexus Repository user and want to understand how some of its features can help you limit ongoing exposure, take a look at this how-to blog showing how you limit exposure.

Finally, if you are interested in understanding how to identify the components that poison the well for all we have good news. The Release Integrity feature of our Sonatype Nexus Advanced Development pack is for you.

Tags: The Central Repository, bintray, namespace, featured, News and Views, Community Product

Written by Brian Fox

Brian Fox is a software developer, innovator and entrepreneur. He is an active contributor within the open source development community, most prominently as a member of the Apache Software Foundation and former Chair of the Apache Maven project. As the CTO and co-founder of Sonatype, he is focused on building a platform for developers and DevOps professionals to build high-quality, secure applications with open source components.