Central Maven Repository Traffic: Using S3

Yesterday, in Central Maven Repository Traffic: Investigation and Analysis, I wrote about the analysis involved in tracking down the increasing load on Central. By identifiying some misbehaving tools, we were able to reduce the traffic from a 98 Mbps average down to 60-80 Mbps. In this post, I discuss the next step toward a Central Maven repository that can scale to meet the load generated by the millions of developers using an ecosystem of tools which rely on the Central Maven Repository.

Where We Left Off...

As a refresher, here's what the load looked like at the end. To summarize, Central was experience load problems on httpd, we subsequently moved to Nginx which fixed the load problem but caused us to regularly saturate a 100 Mbps pipe. After a few weeks of investigation, we discovered that much of the traffic was due to a misconfigured product that was repeatedly downloading the Nexus index. We notified the responsible project and blocked access to the index until the problem was fixed. The end product of all of that work was a Central Maven Repository that had fewer saturation events and which was operating in a consistent 60-80 Mbps range in the middle of a work week.

We still experienced significant amounts of traffic on Monday morning as users fired up M2Eclipse and downloaded the updated weekly index. We tried shifting the index creation from Sunday night to Friday afternoon, to help smooth out some traffic over the weekend, but ultimately this didn't help much. We ended up having to QOS the transfers of the zip down to something like 30 kbps to keep the rest of the repository available during these loads. This wasn't a great solution as it significantly increased the time for users to get at the index.

Considering Amazon S3

One of the ideas we had been persuing was to use Amazon S3 to host central. For those of you unfamiliar with S3, it is a Cloud based system for storing and serving data and it is a part of the Amazon Web Services products (EC2, S3, Cloudfront). The nearly unlimited bandwidth and option to use Cloudfront to bring the data physically closer to the users is a definite bonus. It was clear to us that we need something that could scale indefinitely as we continue to see increased adoption of tools that rely on the Central Repository. The drawback with S3, is that we don't have a direct ability to monitor the traffic and uncover any abuse or misconfigured tools like we do now. We all agreed that moving to something like S3 was the future.

Despite our frequent pleas to use Repository Managers and not scrape the repo, we continue to find people doing just that nearly every day. The bandwidth on S3 is not free and opening it up without the ability to protect the bottom line from abusers is not a great idea. Once you start designing systems on an Internet Scale you start to realize that bandwidth isn't free.

Moving to "The Cloud": Amazon S3

We decided first to take baby steps and see if S3 could help us with a very specific problem: The index downloads. Instead of downloading multiple GB of repository data and creating an index locally, we think it makes more sense for people and tools to download a repository index once a week. This index weighs in at 30 MB, and while that might not seem to be very large at first glance... multiply 30 MB by 50,000 downloads, and you'll quickly start to serve TB of data. This is exactly what was happening, and it seemed like an easy target to offload to Amazon S3.

We found a handy Ruby script called S3Sync that we use to synchronize the maven2/.index folder over to a "repo1.maven.org" bucket on S3. Nginx is then configured to send temporary redirects (302) for all /.index/ requests over to S3. By doing this, we are still able to manage traffic to the Index, yet offload the bulk of the data transfer to the S3 network.

So how did it work? Unbelievably well. Take a look at the week prior to the shift to S3 and week after the shift to S3:

Take a look at the Weekly traffic before the switch to S3, realizing that the weekly traffic numbers mask the periodic 100 Mbps saturation that we were seeing in the hourly graphs. A new index was published, and we saw a spike of download activity during the morning of December 1st. After that our weekly traffic gradually diminishes to a background level of between 40 Mbps and 80 Mbps on an average weekday. Again, note that even those good days had periods of complete saturation, after blocking the offending tools we saw an improvment, but we wanted to offload the bulk of our index downloads to S3 to gain further improvements.

Now, take a look at the weekly traffic graph after moving the index to S3. Unless you note the difference in Y Axis scale, this might not seem as impressive. The number we're most interested in decreasing is the 95th percentile value, it describes the 5-minute average bandwidth which 95% of our traffic falls within. If our 95th percentile number is very close to 100 MBps it means that we're likely to saturate the 100 MBps quite often. If our 95th percentile is down around 20 MBps, we're much more likely to have a stable and available repository. After moving the index to S3, we have a 5x decrease in the 95th percentile from 98.7 MBps to 12.3 MBps. We went from a weekly bandwith average of 49.28 MBps to 8.24 MBps, and our total transfer for the week went from 3.81 TB to 629.7 GB.

Before moving the index to S3, we were suffering from slow response times and an index download which would take a few minutes during peak traffic. After we moved the index to S3, response times improved by 300%, and the index download now takes a few seconds to complete. Just moving the index to S3 yielded dramatic improvements for the Central Maven Repository.

The Result: Greater Speed, Higher Availability

In the 2 weeks since we moved the index to S3 it has served over 4TB of data and 730,000 requests! Even though the bandwidth bill for the S3 service is significant, we estimate it to be half of what we save on the Central connection traffic. In other words, we've increase availability and reduced the overhead costs associated with the Central Maven Repository.

A note of caution: To protect the system from abuse, we may have to change the urls on S3 from time to time, so don't point directly at the S3 url, continue to request the data from Central and you'll be ok.

We won't stop here in finding ways to optimize the repository experience for users. One thing that will be rolled out shortly is Incremental Index support. This will enable tools that use the Nexus Indexer API to grab only chunks of the index that have changed. This should have a significant impact again on the amount of traffic. We also continue to investigate the possibility of leveraging the cloud to host the entire repository so stay tuned.

Note: If you liked the screenshots used in these past two entries, check out Jing Project. Jing is a great free and easy to use tool for capturing screen images or videos and marking them up. Both OSX and Windows versions are available.

Central Maven Repository Traffic: Using S3

Where We Left Off...

Considering Amazon S3

Moving to "The Cloud": Amazon S3

The Result: Greater Speed, Higher Availability

Written by Brian Fox

Secure your software supply chain

Subscribe for all the latest software security news and events